We call ourselves a data science consulting company. But what exactly is data science? This question turns out to be more multifaceted than it might first appear? The easy answer is to just read off the two terms and associate in your head some long-held definition. Most people think of data as a collection of information to be processed in some way. And most people think of science as a systemized collection of knowledge or a maybe a process for attaining that systemized knowledge. Both definitions give a good starting point for thinking about data science.
To get a better idea of why so many people are talking about data science, consider this often cited report from the McKinsey Global Institute that, by 2018, the United States faces a shortage of up 190,000 people with the "deep analytical skills” needed to handle the data being generated in the current Big Data era. The same report also states that there will be a shortage of 1.5 million managers and analysts to do related work. You don't have to be an economist to conclude that labor shortages like this lead to higher salaries, which in turn leads to greater general interest in a particular labor market.
Many people argue that the term data science is just a marketing ploy to rebrand the age-old field of statistics. This argument is not completely incorrect. Data science is very much about statistics. That people are excited about and are talking about the possibilities and applications of statistics shouldn't be viewed in some kind of negative light. But I would say though that the primary reason that the marketing ploy argument is made is to suggest or openly state that the current phenomenon is something that's always been done by a particular discipline. I believe these kinds of arguments miss the mark for reasons I’ll state below. And I don’t say this to single out the field of statistics because data science, as we’ll see, is a highly interdisciplinary discipline, and there many fields that could and probably do claim that “we’ve always done this.”
Here’s a commonly used definition of data science:
Data science involves the extraction of insightful information from messy data.
This simple definition brings into focus where I believe the “data science = statistics” arguments miss the mark. Data science is much more concerned with and focused on the data pre-processing step of data analysis. This is the step where the analyst first has to devise a plan to deal with, in what I think the most interesting cases, complex and unstructured data. Further, data science is much more concerned with automating the data pre-processing step and other computational implementation aspects than has historically been the case in fields like statistics and economics. Why is automation important or even desirable?
As our ability to create and store larger data sets grows, it becomes very difficult or impossible to manually create or process them. A human simply can’t scan every word or read through millions of documents or files in a time-effective way. Computers can do these tasks very easily and associated computer programs can be written to automate tedious pre-processing tasks. Computers are also very good at helping us organize, search, model, and simulate network structures and relationships in an automated way. Network structures can be physical like an electrical circuit or social like one's network of friends. These automated processes are guided by algorithms, which are a central concern of computer science. Once the data is pre-processed, then traditional methods from statistics like parameter estimation and, more generally, empirical analysis come back into play. Future blogs will cover how traditional methods from statistics compare with the “newer” methodology from machine learning.
How did the shortage mentioned above come about? After all, computer scientists and engineers are very good at building automated systems. Statisticians, econometricians, and psychologists are very good at building models. I believe a culprit for this shortage is discussed in articles like this that talk about the notion of stove-piping, which plagues institutions from academia to government. In short, stove-piping describes situations when/where institutions and people place walls around themselves, cutting off the free flow of ideas and information with neighboring institutions and people.
Given the economic stimulus mentioned above, universities across the country are scrambling to stand up data science and data analytics programs and departments. This is understandable, but is this really a good thing? Do we really need these new programs that some argue are “watered-down” versions of their underlying disciplines? Does the trend towards “interdisciplinarity,” particularly in academia, produce well-rounded scientists and analysts, or does the trend actually hinder the goal of producing people with the deep analytical skills that come with learning one of the traditional disciplines well? These are definitely open questions.
I’ll close with AD&T’s view of data science, which has three components. The first two components—statistics and computer science—are well-known and described above. I’ll add a third component—the arts. As one of the commenters in this blog puts it, “art is the science of expression.” Useful data tells a story that can be visualized in creative and innovative ways. So it is with data science that you have intangible aspects that by definition are difficult to define, but are very important for decision making in the era of Big Data. It’s not always obvious what a term like “insightful” or “interesting” or “creative” means, but you know it when you see it. So long after, as some suggest, advances in automated speech recognition technology render many of our current programming languages and frameworks obsolete, we’ll still wrestle with the intangible and metaphysical aspects of knowledge discovery that elude formulas. But wait, there’s a formula for that too.
where the omega variables can be thought of as weighting factors.