Interviewing is a noisy prediction problem
I have done roughly 2,000 interviews in my life. When I started recruiting, I had so much confidence in my ability to assess people. Let me just throw a couple of algorithm questions at a candidate and then I’ll tell you if they are good or not!
Over time I’ve come to the (slightly disappointing) realization that knowing who’s going to be good at their job is an extremely hard problem. The correlation between who did really well in the interview process and who performs really well at work is really weak.
Confronted by this observation I’ve started thinking about this process as inherent noise reduction problem. Interviewing should be thought of as information gathering. You should consciously design the process to be the most predictive of future job performance. Given that you have limited time to measure, you need spend your time measuring things that have high signal-to-noise ratio and things that have low correlation with each other.
Interviewing as a prediction problem
Job interview scene from Devil Wears Prada (2006)
Let’s start by stating the problem. We’re trying to predict job performance from a series of measurements (interviews). Those measurements are noisy meaning that any individual measurement is not very predictive in itself, but hopefully all of them taken in aggregate can be predictive. We can also choose what we want to observe ahead of time, by coming up with an interview process where we think the aggregate judgement correlates the most with job performance. So we can choose to spend one hour on system design, one hour on algorithms, etc.
If we had complete data about candidates and their future job performance, it would look something like this
|Algorithms||System design||Grades||Github portfolio||Actual job performance|
There’s many simplifications of thinking about it this way. For instance, we don’t necessarily know the actual job performance until they already got hired and worked with us (and even then, that’s a hard problem to estimate, but a separate hard problem). But let’s stick to this way of thinking because it will become useful later. Note that it’s not just a table, it’s also a matrix! I will get back to this and do some magic linear algebra on it later.
Restating interviewing as a prediction problem might seem like a super obvious thing, but I think people often forget this point. I’ve read about 1,000 Hacker News comments complaining that interview questions about turning a binary tree upside down (or whatever) are stupid because no one would ever do that, or there’s already a library for it, or something else. I think that’s completely besides the point! The real question is: does solving a problem about turning a binary tree upside down predict future job performance?
Interviewing is tricky – you only have a few hours to collect measurements. I have no moral objection against asking “unrealistic” questions, as long as I have some reasonable expectation that they predict future job performance. In fact, I think any sort of interviewing is going to be unrealistic, and I’m very happy that there are some decent ways to evaluate the performance of engineers. I have no idea of how carpenters hire carpenters or how investment bankers hire investment bankers, but I wouldn’t be surprised to hear that their interview process they have is far less objective.
The depressing science on interviews
Let’s review research and see what measurements to pick that are most predictive. Turns out there’s a bit of research suggesting interesting things. One thing that almost all research seems to indicate is that overall, interviewing is a lot less predictive than we want it to be.
- Cortina, Goldstein et. al. (2000) find that highly structured interviews add some signal on top of testing cognitive ability + conscientiousness, but that unstructured interviews are almost useless.
- Highhouse (2008) finds that interviewers perceive of unstructured interviews as the most predictive, and intelligence tests as the least predictive, although the estimated actual effectiveness is actually the opposite.
- The jury is still out whether grades predict much: Google claims no and this blog post sees no relationship between grades and interview performance, whereas this paper claims that grades are more predictive of “important life outcomes than IQ”.
- Some recent news articles include Bloomberg calling interviews useless, NYT calling them utterly useless, and The Guardian calling them pointless. To be clear, they all attack the unstructured interview.
A few studies seem to at least call out intelligence as a decent predictor:
- Vinchur and Shlippmann (1998) performs a meta-analysis on sales data and find that “general cognitive” ability is the strongest predictor for job rating, although a weak one for sales (N = 1200)
- Gottfredson (1998) finds that IQ is a good predictor for life outcomes overall.
- Schmidt and Hunter (2004): General mental ability predicts both occupational level attained and performance within one’s chosen occupation and does so better than any other ability, trait, or disposition and better than job experience (here’s a summary of a lot of the work by the authors).
I’m not a scholar and I can’t claim to have done an extensive literature review, but here’s my biased summary:
- Intelligence tests seem to be the strongest predictors of job performance (although possibly offensive, and probably illegal in the US)
- … followed by structured interviews
- … followed by unstructured interviews (which have very little predictive power).
Picking the measurements that matter the most
Let’s focus for a second on what measurements matter the most. Since we can’t measure actual job performance, we need to measure other skills. Those measurements will be highly noisy. So we need to find measurements that (a) are decent proxies (b) have as high signal-to-noise ratio.
I don’t have a grand unified theory here. But I have a long list of ideas I think are helpful.
Bad interview signals
- Judging by the literature review earlier in this post, unstructured interviews are almost useless. Avoid them.
- Some interview questions spend 60 minutes focusing on a problem that relies on a single insight to solve. That’s extremely low signal-to-noise ratio! Try to avoid such problems.
- Likewise, avoid long interview questions depend on some kind of “trick” or “insight”. You get basically 1 bit of information out of this question.
- You want the highest signal-to-noise ratio, so you should design your interview to make the candidate talk as much as possible. Besides the time set aside for questions, the candidate should do 80-90% of the talking, not the interviewer. Every minute the interviewer spends talking is opportunity cost!
- However, it’s important the interviewer keeps driving the discussion. If the candidate is talking about something and it veers off into tangential territory, you need to bring it back asap. Again, opportunity cost!
- Certain signals like “having a great Github portfolio” can be great signals, but have tons of false negatives. Meaning that the presence of those signals can be great, but the absence isn’t necessarily a big deal. I’ve hired many amazing engineers that had absolutely nothing on their Github, and that’s fine!
- I’m a bit skeptical of live coding as an interview, mostly because it’s hard to do without forcing a certain environment or methodology onto a developer. As a result, you end up measuring mostly how familiar they are with the environment. But maybe it could be done well for certain tasks.
- I don’t think algorithms on a whiteboard are great, but mostly because they take so much time. Asking a candidate to implement Levenshtein distance can take up a full hour, and again, the opportunity cost matters. Quick algorithm questions (max 5-10 min) can be fine, although another concern I have with algorithm questions is that junior CS grads score well for no other reason than having their algo class fresh in mind.
- I’m not a big fan of take-home projects. They hurt people with families and commitments. It also tends to increase dropoff for the people that are in highest demand, which is exactly the people you don’t want to lose. I still do homework assignments for a few roles where I haven’t found a better interview format, but we limit them to 3-5 hours and we pay for them.
- Cultural interviews and using the “beer test” is great if you want to hire people for no other reason than they are similar to you. I don’t do them. If a candidate is a wacko, you’ll notice it anyway. I do take candidates out for lunch as a way for them to get to know us, but there’s no evaluation going on.
- I find questions like “what are your weaknesses” or “what is your superpower” to be silly. A good answer just means the candidate is verbal and can ramble on the spot. It favors verbal extroverted people without reflecting on their actual skills.
- I don’t think knowing a lot of obscure features of a particular language is super useful. Good engineers know a bunch of languages and can pick up new languages quickly. If you’re interviewing for a backend role, don’t spend 60 minutes probing into C++ template metaprogramming. The main exception to this rules is roles that are obviously tied to a particular tool (like CSS for frontend engineers).
Good interview signals
- While I don’t like long problems that rely on knowing a certain trick, I think it’s great to have many short interview questions that rely on knowing particular things. If you can go through 20 such problems in one single interview, you increase the signal-to-noise ratio a lot!
- I’ve experimented with many types of interviews, but one thing I really dig lately is code reading. I print out 10-15 code snippets (not just code, but also UNIX commands, regular expressions, SQL queries and many other things). I then go through and ask: what does this snippet do, how does it work, are there any bugs, etc.? Reading code (as opposed to writing) means I can cover a lot of ground extremely quickly, spending no more than a minute or two on each problem.
- Homework assignment can be OK, although I think it’s important to limit them to 3-5 hours, and definitely pay for them!
- Consulting agreements (with the intent to hire) are great for the people that have that flexibility. It gives an opt-out on both sides after a certain time period.
- I like system design questions (“how would you build this feature”). Anecdotally they seem to capture experience quite well.
- While I’m skeptical about going deep into a particular language’s features, I think it can be a good idea to cover many different languages and their features. Not everyone is proficient in every language, but decent engineers have at least one or two languages that they have a deeper knowledge of.
- I have had some decent success using online coding tests like Hackerrank. They skew way too algorithmic IMO, but so far they seem to be a fairly good predictor of future job performance. The main issue is that senior candidates often find it offensive to do these tests, so I usually reserve them for more junior candidates.
- I’m skeptical about trying to assess communication skills directly, but I do think there’s a certain “clarity of thought” that intelligent people exhibit.
- While the absense of cool projects on Github is not a big deal, the presence of projects is a very positive signal to me. Same goes with open source contributions, blog posts, or other things.
- There’s another 100 things I could list about how I evaluate a candidate’s resume, but I think in the interest of brevity I’ll leave it for another blog post!
- A final word on intelligence tests. Literature suggests that they can have decent predictive power. I think “brain teasers” can work as a proxy for intelligence tests. I’ve never used them during interviews but I’d imagine they might have a certain place, like maybe if you’re interviewing for a weird role where there’s literally nothing else that you can measure. I don’t think we should categorically throw them out.
Job interview scene from Office Space (1999)
Going back to the table of interview rankings and let’s introduce some math. Stay calm, the math is not harmful! If you are squeamish about math, feel free to skim.
Let’s assume all the entries in the matrix are known, and that we are trying to find the best linear combination that predicts the future job performance the best. One way you can phrase this problem is to say you want the ’s that minimize the quadratic error. This is a classic linear regression problem:
The matrix is the matrix with interview scores, i.e.
And the vector is the vector of actual job performance, i.e.
This isn’t a particularly hard problem from a mathematical point of view to solve (it even has a closed form solution). But it’s going to be super noisy! Why?
- We don’t have much complete data, not substantially more than the number of people you ever hired.
- Each interview is a super noisy measurement of the underlying ability.
- You only collected data for the handful of interview types that you have been focusing on. Everything else is unknown.
- There’s a tricky survivor bias where we only observe the for the people that we hire. This can lead to weird patterns due to something called Berkson’s paradox (like when Google noticed that programming competitions were negatively correlated with job performance).
So how can we combine signals more efficiently?
- We should try to pick interviews that are proxies that have little correlation with each other. If we do two interviews that have absolutely no correlation, but each of them is a proxy for actual job performance, then the aggregate error decreases.
- If you have no idea which interviews are predictive, at least it seems reasonable to assume that at least all , i.e. no interview is negatively correlated with future job performance. From that point of view, it makes sense to try to cover as many different topics as possible. A bunch of them might not say anything about future job performance, some of them do, you don’t know which ones so why not cover many different topics?
- There’s an inherent bias-variance tradeoff. It might be worth lowering the error a bit while introducing a bit more bias. This actually gets interesting when you think about the risk aversity that different companies may have.
- A large company like Google might actually afford to be a lot more risky. They can hire effectively for the expected value. For them, putting all their weight on a few things that correlate the most with job performance (intelligence test and various proxies for it, like algorithm puzzles).
- A small startup might be super risk averse. The cost of hiring the wrong person is super high because it can completely derail their momentum. So they might hire for some expected risk-adjusted function of the future job performance. This effectively ends up distributing the weights over more interviews. So rather than going all in on one or two interviews, it might be wise to focus on many different things. And I literally mean something like: spend 10 minutes on regular expressions, 10 minutes on HTML, 10 minutes on functional programming, 10 minutes on SQL, 10 minutes on AWS, etc. etc.
- This also reminds me of portfolio optimization. If you can tolerate more risk, invest everything in stocks. If you are risk averse, hedge your bets and diversify over many asset classes.
Humans are prone to confirmation bias. We subconsciously form an opinion about things, and let that influence our decision making. This is dangerous! You start to like a particular candidate a lot for whatever superficial reason, you drop your guard, start giving them a bit more hints or give them the benefit of the doubt in a way that some other candidate wouldn’t get.
The main remedy is to have a structured interview process. If everyone goes through (roughly) the same questions, your judgement becomes more objective. The benefit isn’t just that it’s more fair and you’re not perpetuating biases – you also end up hiring better talent.
There’s a separate trick I have for trying to make more neutral judgement is to ask myself: what would I have to see in order for me to change my mind about this candidate? If I start out super excited about a candidate, and they nail three questions in a row about system design, then I try to bring out devil’s advocate: maybe this person lacks something else? And I switch to some completely different topic, like regular expressions or UNIX commands. Conversely if someone doesn’t do well, think of a hypothetical question where they might win you back. Always try to poke holes in your own judgement.
Side note, but a fascinating study on objective interviews is how blind orchestra auditions became prevalent a few decades ago.
Sanity checking your interview process: would your best engineers do well?
Job interview scene from Trainspotting (1996)
I have mentioned a lot of interview formats that I think are better than others. On top of that, the most critical thing is that you learn from your interviews and continuously refine your interview set up. Over time, look at the people that you hired and ask yourself, what are the strongest predictors of future job performance?
I find this to be a powerful sanity check. Look at the best engineers in your company and try to picture them going through your interview process. Would they do extremely well in the process? Conversely, would the low performance engineers do worse? Or you can also look at the current top performers at your company, and ask yourself: what type of interview would have selected these people? If your interview process would have filtered out your best engineers, then you should probably reevaluate the process.
Some final thoughts
Interviewing is one of those things where every year I realize how much I sucked at it a year ago. I think my ability to predict future job performance has increased over time, but more importantly I’ve learned the hard way how hard it is. If I were to write this same blog post in a year, I’m sure it will look slightly different. As many other things, the most important thing is to keep learning and iterating.
Markus Frödin pointed out a well-written meta-analysis that revisits a lot of the job performance prediction literature and arrive at slightly different findings. Their main contribution is to correct for the “range restriction” problem – the bias you get from only measuring the people you hire. Their top predictor is intelligence tests, interestingly followed by “integrity tests”. Unstructured and structured interviews have fairly similar strength.
If I interpret the results correctly, maybe unstructured interviews are useful, but people put too much weight on them – that would cause the correlation between interview performance job performance in the hired group to be zero, even though it could be possible across the whole population. Alternatively, unstructured interviews might be a great way to turn down candidates, but maybe not discern among the “good enough” group. Anyway, my approach is always evolving!