Why all scientists are not data scientists

John Hawkins
4 min readOct 14, 2022

Preface: This blog post was originally published in 2017 when these memes were flowing thick and fast. It is still relevant today given the growth of the field and the continuing uncertainty about what data science means.

There is a meme you will see floating around the internet that comes in many forms, one version is shown in the header image above. It is part of the vague internet resistance to this new occupation. The response is somewhat justified, Data Scientist is a job title that requires no specific qualification, and garners differing opinions on what the core skill set is.

In spite of the fuzzy definition of the job there are good reasons that this new occupation exists. To understand those reasons we first need to clear up the misunderstanding that lies at the core this meme.

To someone who has never worked in science it might seem that everything that goes into a piece of scientific research belongs solely to the scientist or team of scientists producing it. For example, all of the content in a molecular biology paper on DNA and protein interaction has been produced by the diligent molecular biologists who did the experiments, analysed the results and published the findings.

However, lots of science requires the use of statistics in order to draw conclusions. Typically not all scientists know everything they need to about statistics in order run the correct analysis. In larger teams there will be someone who has specialised in this kind of work. In other situations the scientists involved will consult with an external statistician to make sure that the analysis is valid.

Those scientists and statisticians who have focused themselves on understanding the limitations and possibilities of making inferences from experimental data are the ones who are the forerunners to data scientists. They have a skill which transcends the particulars of what it takes to do lab work on cell cultures, or field studies for ecology etc. Their core skill involves thinking about the data involved at an abstracted level. To ask the question “given data with these properties, what conclusions can we draw?”

At this point you might well be thinking something like this

“Ok, sure, but then aren’t these data scientists just statisticians?”

You could certainly find some statisticians who will agree with you on this.

It is undeniable that statistics is the field that historically covered the domain of data science. Except, unfortunately for statisticians, computer scientists started making in-roads into their domain when they developed the field of machine learning. While it is true that some of this involved re-inventing existing statistical ideas, it is also true that the emphasis of machine learning was different from statistics. Traditionally statisticians cared more about the explanatory power of variables in their models. Hence they also cared a great deal about goodness of fit of the model to the training data. Whereas machine learning computer scientists cared only about how well the model predicted the future. This, of course, is an over-simplification, but the difference was enough to push machine learning into the lead when it came to building predictive models for certain problems.

You don’t need to take my word for it. The statisticians Trevor Hastie and Rob Tibshirani in this video introducing their Statistical Learning course admit as much. At time point 12:00 they describe the difficulty they had beating neural network models on the task of hand written digit recognition. After first thinking it was easy, they had to concede the task was very hard and their statistical toolkit was not up for it. Their work on problems from machine learning led them to contribute to the developing field of statistical learning (a hybrid of statistics and machine learning), along with brilliant pioneers like Leo Breiman.

So data science is something of a compromise between the tools of statistics and computer science. Bringing in experimental design, inference and testing from the former, and a great number practical ideas about data mining and machine learning from the later. In addition data science typically involves some experience with the way business & computing systems operate. The vast majority of data science work involves contributing to building some kind of system, which requires an understanding of the computational feasibility, latency and timing of all the parts involved.

Now you should see why the occupation of data scientist is not one that just any scientist can do (without first learning a bunch of new tricks). This should also explain why universities are struggling to catch up with adequate training programs. Data science requires significant skill in at least three domains: statistics, computer science and business systems architecture.

Data Science requires, above all else, people who think about data, inferences and prediction at a level abstracted from the messy details. This allows them to work across many domains. At the same time data scientists need to understand how the messy details tend to affect what can be done to produce results for a business.

Originally published at https://www.linkedin.com.

Moving to Medium to consolidate blog posts

--

--