Big Data’s Causation and Correlation Issue

Reading Time: 3 minutes

If Big Data came in a box, it would have stamped on the side: “Warning: Correlation Does Not Imply Causation.” There’s a common thread among Big Data stories, often told as exciting tales of wonder, that correlation somehow approximates causation. It sometimes gets expressed in oblique arguments that more data is better and in stories of the search for the perfect algorithm.

It isn’t that simple, as we wrote a while back: In some cases, like when choosing wine, small data actually matters far more than big. It can come down simply to whether the wine buyer likes wine with heavy tannins or not…so much for bouquet, texture, and fruit.

Causation Versus Correlation

If you’re new to this area, I should explain that causality means A causes B; correlation, on the other hand, means that A and B tend to be observed at the same time. These are very, very different things when it comes to Big Data, but often the difference gets glossed over or ignored. Whether correlation is “good enough” to act without knowing the cause for something depends entirely on the problem being solved and the risks of being wrong.

Is Correlation Good Enough? It Depends…

Gil Press writes in Forbes, explaining this idea very well in his review of the recently-published, widely commented bookBig Data: A Revolution that Will Transform How We Live, Work, and Think:

“For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.” Big data analysis can be about correlations OR causation—it all depends, as it has always been, on what question we are asking, what problem we are solving, and what goal we are trying to achieve.”

pathGil isn’t the only one making this distinction. Algorithms by themselves don’t tell you what data means, and without human input or direction (in the form of hypotheses or data discrimination) can actually steer understanding of data in the wrong direction. Data science, it seems, requires a healthy sense of skepticism.

This is exactly what makes data scientists so hard to find—it isn’t about the “big-ness” of data or the algorithm’s perfection. It is about knowing a great deal about the data so that the true meaning can be coaxed out, not squeezed out by a mindless process.

It will disappoint many to hear that there isn’t always an expensive, industrial solution to ever larger amounts of data. Instead, it often comes down to having great governance of data so that metadata (data about data) can be fully understood and taken into account.

Examples of Correlation Versus Causation

Getting it wrong can be expensive, as shown in Freakonomics example of mistaking correlation for causation that almost led the State of Illinois to send books to every child in the state because studies showed that books in the home correlated to higher test scores. Later studies showed that children from homes with many books did better even if they never read, leading researches to correct their assumptions with the realization that homes where parents buy books have an environment where learning is encouraged and rewarded. These are correlation versus causation in plain view. Illinois didn’t have money to waste going in the wrong direction and neither does today’s enterprise.

A Simple Explanation

Khan Academy, probably the best broad-based learning site on the Internet, has this great video lesson showing the the difference between correlation and causation. It is a great reminder of the limitations of data:

This story was co-authored by Jeanne Roué-Taylor.