How to Hire Data Scientists

Hiring is hard. Hiring data scientists (and MLEs) is harder. This has been my experience building major data teams on a couple of continents and what I advise you actually look for in Individual Contributors and Managers from a vetting perspective.

The Problem

Why is it so hard to hire for data scientists? There’re a few reasons.

  1. Lack of clarity on what a data scientist actually is and does
  2. Easy to bullshit (few non-data scientists can connect their work with outcomes vs outputs)
  3. Obfuscation
  4. Actually a basket of skills, rather than just one

What People Normally get Wrong

On the bad old days of everyone trying to do “big data” you’d often see desperate CIOs or similar in large companies, basically hire a bunch of PhDs and throw them in a corner with some vague direction to “do data science.”

Mostly, these affairs ended about as well you imagine they would, though often stained the term “data science” or “big data” in many organizations, but worryingly most places that do not have strong data science processes and cultures have a tendency to hire quantitative advanced degrees or often confuse data scientists with engineers who can apply or use algorithms.

Sometimes there is a high degree of overlap between these people and data scientists. You’d certainly hope so, but it’s important to be able to distinguish between every new statistics grad that confers a “data scientist” title on themselves but uses only excel and powerpoint, software engineers who can insert an algo like XGBoost and process data, fancy PhDs that can wax eloquent on Generative Adversarial Networks (GANs), and people who can actually deliver sustainable value, solve business problems, and base those on the foundations of good data science.

Data science is actually not one skill, but a half dozen that means you need composite, balanced individuals with broader skillsets. Optimizing solely on advanced quantitative or algorithmic hard skills or academic background leads to grossly unbalanced teams that cannot deliver good data science and have a deep tendency to focus on outputs over outcomes (and leads to being unable to attract better people since prospective recruits will suffer from imposter syndrome if they don’t have a PhD in quantum mechanics or can code catboost from first principles.).

Fact is, most people with a decently strong quantitative background can become great data scientists. But most are taught by managers (hired similarly), to focus on the wrong things. And really, the thing that separates great data scientists from the mediocre and even poor are actually not advanced math skills. Fact of the matter is, for the majority of business problems, simpler models work much better and more reliably than fancier ones due to estimation errors and the limits on the precision and availability of well-labelled data.

What Data Scientists Do

Why do we even call these people data “scientists”? The term is ripe for abuse and overlap between Analysts, Machine Learning Engineers, and even Data Engineers. Ignoring actual titles for a minute, let’s talk about what they should be doing.

Data scientists

  1. apply the scientific method
  2. to data
  3. to test a hypothesis
  4. solving a business problem

Applying the scientific method is the science part of the scientist. Their job is not to “do data science”, “build a model”, or “do R&D”. It’s to deliver demonstrable impact and business value indicated by improvement on metrics the business actually cares about (and making sure changing those does not negatively impact other metrics the business cares about).

How? Through following a good process that allows us to theorize cause and effect, and build a model that allows the business often to make better decisions (either automated, in the case with machine systems, or in conjunction with people making those decisions through analysis or the provision of tools that inform those decisions.).

What You Should Actually Look for

Always look for full stack data scientists. And by that, I mean people who have a decent background in statistics and understand mathematical underpinnings and techniques, can manipulate data (or better, code) in a programming language (and I highly recommend Python), and aspirationally, build a model that can return a result through an API. All these things are baseline from my perspective, but beyond baseline, this is what I believe you should optimize for:

Process and Method

One of the first questions I generally ask of almost every more senior data scientist or manager is about their process: “Walk me through how you’d approach problem X”. The worst responses to these question are people jumping directly to a number of pat algorithmic approaches, often overly sophisticated and advanced, in terms of how they think they can solve the problem without asking questions. Also, usually this just signals to me that the person in question is enamoured of a particular approach and slightly biased (with a high correlation on what they focused on in their grad degree.).

These are bad answers. What I really want to hear is about their process to approaching problems. The best answers methodically walk me through their approach to how they would go about tackling the problem, often from the non-algorithmic side, and how they would understand the problem, approach analysis, get or create the data they require, look at it, and understand it and its limits.

I’m in heaven if people discuss a method that looks a lot like the scientific method. I also like hearing the words “literature search.” (you’d be shocked how many people do not consult prior art on their problem.). In fact, Managers or even directors I’m considering who leap directly to a solution because they might have used something similar before and don’t ask questions about the particulars of the use case I’m giving are often dangerous and if unable to talk about their process (particularly in terms of how they guide and grow their team members), usually get marked low in my scoring.

Problem Statements and Good Questions

Perhaps one of the most underrated skills great data scientists needs is the ability to get a project or product down to a concise, clear, compact problem statement that describes a problem actually solveable by data science. This combines both insightfulness, communications, and stakeholder wrangling which is essential to a scientist’s success, particularly as they move into senior levels and management. And to do so a data scientist generally needs to ask good questions.

Why? What does this mean? Data science goes horribly wrong usually with lack of a succinct problem statement. If your data scientist takes direction from stakeholders like “we need to build a collaborative filter”, or “optimize x”, or “we need a system to target users” without being able to get that down to an extremely compact problem statement about what problem specifically is being solved and how they know it will be solved (as well as contraindications of what affected the problem might be - broad example, revenue needs to improve by x% with no decline in retention rate.), you are eventually, if not immediately, going to go off the rails at some point or deliver a product which is not fit for purpose.

Well-meaning but misguided stakeholders will often describe a problem much differently than the actual success criteria for a project, leading to a project that will end up delivering on what it thought the problem was, or even how it may have been described to them, but will ultimately not solve the underlying problem and end up being either sub-optimal or not adopted (eg. “minimize incentives budget” often leads to a very different outcome than “find a way to reduce the incentives budget while maintaining retention rates.”)

Critical in this is also making sure you are solving the right problem and can demonstrate you have solved it unambiguously. Often this has to do with after-effects - a product manager or business head may have changed something and it did not have the effect they wanted (and they’re actually asking you to prove it’s not a wash - understanding the context is key. What were the assumptions? Why did they believe this would work? Did they just blindly develop something without testing the underlying customer assumption this feature might hinge upon?

And to do that, a good data scientist needs to ask questions, poke at assumptions, understand the underlying business mechanics, sniff for bias (especially, as this is one of the most insidious way data science projects fail), and be able to synthesize a range of hypotheses, and get agreement from the stakeholders. A bunch of skills are often tied up in this from communications, to wrangling, to scope control. But they all start with the ability to ask good questions, understanding the real problem, what solved (or improved) looks like, and get down to synthesizing a proper problem statement to work around.

Impact

A data scientist who does not drive business impact is a researcher. It is important to know the difference.

You can right away sift out CVs from data scientists that speak only about what they did but not how it affected the business. Better yet, teach your talent acquisition team to do this for you before they even get to you. People who can break this down to specific metrics in the Did x causing y by z formulation are good to speak to. Exceptions possibly made for people who have only been in academia previously because you have to vet to see if they can make the jump to business and have impact. They’ll often have the quantitative skills but it is key to determining if they can have impact.

The problem here now is that many data scientists, particularly ones who do not have impact, have learned through various CV advice-giving sites that it is important for them to describe how their work impacted the business with the above formula. So, it is important to drill down on claims that people make in their CVs around those improvements. Ask what they built, the mechanism by which it improved things, ask about the details, and give such claims a sniff test. I find that about 50% of the time, someone is waffling, exaggerating, or taking credit for something they were very peripherally associated with and actually someone else’s work (really check this with managers who were responsible for large or outsourcing teams - often they just parcelled out work and don’t have a good grasp of what was going on.).

Also, I find good data scientists get excited about describing the impact they have (because, after all, it’s validation of their work) as well as the interesting details about the process and mechanics they used to make things work. Ask about problems and what was interesting about their projects, and be wary of people who claim impact, but then answer with “well, I wasn’t that closely associated with the details, but… “, are overly hand-wavey, or are obviously exaggerating. These are all sniff test fails. Good data scientists can tell the story around the result.

Data Understanding

One of the biggest distinctions I find between really great data scientists and people who tal a good game are that good data scientists have a drive to develop an innate understanding of their data. This is not trivial as most data sets data scientists get, particularly if they’re not dedicated to a particular business unit or system are dirty, messy things in organizations that have focused more on product features and where an emphasis on data quality is low.

Good data scientists are willing to spend time understanding the shape and the smell of the data they have access to and are willing to roll up their sleeves and get dirty running their hands around in it exploring it. This is not trivial. Often hidden traps, strange measurements, and weird quirks, or historical time bombs lie in wait for our unsuspecting future heroes.

Data scientists who ignore these things do so at their peril. They must be able to extract, manipulate, and derive viable and significant features from the data they have access to… in really exceptional cases derive improvements to the way data should be recorded in order to allow greater use in the future, and even derive experiments or changes to product where the data does not exist in order to create the data needed to figure out their answers. In cases where direct data does not exist, come up with clever proxies in order to attempt to get close approximations.

This is key. If they are just taking data that you get as a comfy, delivered package and running it through a model, you’re most likely doing it wrong. And in fact, probably leaving a lot of value on the table, if not creating models that are likely making bad predictions. And data scientists need to do this in an end-to-end fashion as opposed to the “just model this” nature of some big, slow-moving, sclerotic types of companies. So, suffice it to say, data scientists need to be able to extract, manipulate, munge, and understand the composition and distribution of their data sets despite their size and complexity.

Not doing so leads to bad things happening which you often see among well-meaning but overly-ambitious software engineers, like misapplying algorithms meant for particular data distributions to datasets (a classic example, applying k-means clustering to non-globular data and torturing it for marketing segmentation or not being sensitive to overfitting with decision trees in datasets with many outliers.).

Fin

These are the things that I have found that separate out the wheat from the chaff among people describing themselves as data scientists. Besides this hire for hustle and attitude and people who are just intrinsically motivated and curious about looking at and answering things. Try to steer clear of people who are obviously just interested in the pay bumps they might receive at your company. Good data scientists are driven by how interesting their work is, the impact it has, and how much they will learn, so your job in paying them is making sure you remove the thought of money from their heads. If that is their overriding concern, or they seem to be more about exploiting the current market premium, I’d go for a Moneyball strategy in recruitment. Also, beware people who have skipped around too much. Large claims of huge benefits where someone is spending a year at a place usually means they were modelling but never actually had a data product and generally shows there was either little return from what they did (since they certainly would be unable to demonstrate it) or that there is a management issue there.

They should be genuinely interested in the types of problems you and your company need to solve and be asking questions about those in the interview. They need to buy into your mission and where you’re going. I’d worry less about whether they have practical experience with LTSMs or GANs or other impressive sounding techniques as the ugly truth of most data science is that most of your problems are solveable with much more simply algos. It’s only as you start moving in the direction of squeezing out declining returns on your models, and with lots (and lots) of highly accurate data that these advanced approaches become relevant.

So, make sure they are smart, get shit done, and play with others and follow the guidelines above and hopefully you’ll be able to recruit and retain an amazing team. The value you can bring your organization is enormous and it’s amazing helming a high performing team that does great, impactful work, works together well, and feels good about itself. Good luck!

Let me know what you think about the post @awws or hola@wakatara.com . I’d love to hear feedback about what else has worked for you or if you feel I’ve missed something (or am totally off base) with (reasoned) opinions on why I might be wrong and what make this post better.