What is Big Data?

You see and hear the term Big Data everywhere but what does that actually mean? Let's start with a simple definition. You can make an argument that the current and modern Big Data era was ushered in with the introduction of the MapReduce model in this famous white paper by Google®. The first sentence of the paper’s abstract provides a defining characteristic of Big Data:


MapReduce is a programming model and an associated implementation for processing and generating large data sets.

The important point from this simple definition is that Big Data is a framework rather than some pre-set dataset size. It is a framework that enables scalability. You should not be overly concerned with hard dataset sizes because the rate of progress in computational technology quickly renders dataset size benchmarks obsolete. The Apache Hadoop ecosystem, which originated from Google’s MapReduce model and that is today synonymous with Big Data, is also a framework. The spirit of the definition also extends to commercial cloud computing services offered in the marketplace.

Although there is a lot of hype about Big Data and large data sets, mathematicians and statisticians have long studied related problems and have developed reasonably good theories and procedures for dealing with problems involving very large numbers of observations. Without wading into a deep discussion of the mathematical and statistical properties of estimators, I'll just say that there have been proofs of the Laws of Large Numbers, which are theoretical underpinnings of statistical estimation theory, dating back to the 18th century.

The great polymaths of that time, first Bernoulli and later Chebyshev, had a lot to say about "Big Data," which we'll define for the moment as data gathered from experiments—repeated, independently, a very large number of times under identical conditions. Independence is a particularly important aspect of the MapReduce model when thought of in terms of parallel computation.  I'll also mention that there generally is a distinction between high performance computing (HPC) and the MapReduce model even though both make significant use of parallel computation. I say this not to start into a deep technical discussion on the distinction between the two, but rather to highlight that frameworks that provide functionality similar to MapReduce existed—primarily in the HPC/supercomputing world—long before the public introduction of MapReduce.  What's not really debatable is that Big Data is now part of the popular lexicon, and I believe MapReduce deserves a lot of credit for that.

Advances in computational technology have allowed us to store and process ever larger amounts of data. An important question, however, is whether Big Data actually leads to better decision making or scientific analysis compared to the olden days of Small Data.  This is an open question. Generally speaking, more data is better than less data.  We can be more confident about things like point estimates with larger numbers of observations. That said, statistical analyses are often done to determine exactly what sample size is needed to reach a conclusion given a pre-set confidence interval.  Sample sizes from these kinds of studies are typically numbers less than 1000—not hundreds of thousands or millions. Do you really need a million observations to tell you the same thing that a thousand observations do?  How meaningful are commonly used statistics like p-values when the observation size n is in the millions or billions?  

I am aware of some of the other Big Data definitions out there.  To narrow the scope of this blog post, the definition provided here is intentionally tied more closely to statistical theory and methods than some of the other definitions.  Why? Because I believe statistical reasoning is the best way of judging the value or usefulness of results derived from Big Data analyses. Future blog posts will cover some of the other commonly cited components of a Big Data definition like velocity and complexity.  The MapReduce paper above does an excellent job of introducing and describing a framework for creating very large data sets. The two-part question then is, do statisticians—applied (anyone who uses statistical methods for their work) and pure—view this as an interesting and exciting development or do they view Big Data as the latest reincarnation of an issue that was settled long ago? I found this paper helpful when I first started looking into these issues.