GARY SMITH: The effect of bad data on good science
Coffee was very popular in Sweden in the 17th century – and also illegal. King Gustav III believed it was a slow poison and devised a clever experiment to prove it.
He commuted the sentences of the murderous twin brothers, who were awaiting beheading, on one condition: one brother had to drink three pots of coffee a day while the other drank three pots of tea. The premature death of the coffee drinker would prove that coffee was poisonous.
It turned out that the coffee-drinking twin outlived the tea-drinker, but it wasn’t until the 1820s that Swedes were finally legally allowed to do what they had always done: drink coffee, lots of coffee.
The cornerstone of the scientific revolution is the insistence that claims be tested with data, ideally in a randomized controlled trial. Gustav’s experiment was notable for its use of identical male twins, which eliminated the confounding effects of sex, age, and genes. The most glaring weakness was that nothing statistically convincing can come from such a small sample.
The problem today is not the scarcity of data, but the opposite. We have too much data and it undermines the credibility of science.
Luck is inherent in random trials. In a medical study, some patients may be healthier. In an agricultural study, some soils may be more fertile. In an educational study, some students may be more motivated. Researchers therefore calculate the probability (the p-value) that the results may occur by chance. A low p-value indicates that the results cannot easily be attributed to luck of the draw.
How low? In the 1920s, the great British statistician Ronald Fisher said he considered p-values below 5% compelling, so that 5% became the barrier to the “statistically significant” certification needed for publication, funding and fame.
It is not a difficult obstacle. Suppose an unfortunate researcher calculates correlations between hundreds of variables, blissfully unaware that the data is actually random numbers. On average, one out of 20 correlations will be statistically significant, even if each correlation is nothing more than a coincidence.
Real researchers don’t correlate random numbers, but too often they correlate what are essentially randomly chosen variables. This random search for statistical significance even has a name: data mining. As with random numbers, the correlation between randomly chosen unrelated variables has a 5% chance of being fortuitously statistically significant. Data mining can be augmented by manipulating, pruning, and torturing the data to obtain low p-values.
To find statistical significance, you just have to look hard enough. The 5% hurdle has had the perverse effect of encouraging researchers to do more tests and report more meaningless results.
So stupid relationships get published in good journals because the results are statistically significant.
• Students perform better on a recall test if they study for the test after taking it (Journal of Personality and Social Psychology).
• Japanese Americans are prone to heart attacks on the fourth day of the month (British Medical Journal).
• Bitcoin prices can be predicted from stock returns in the cardboard, container and box industry (National Bureau of Economic Research).
• Elderly Chinese women can postpone their death until after the Harvest Moon Festival celebration (Journal of the American Medical Association).
• Women who eat breakfast cereals daily are more likely to have male babies (Proceedings of the Royal Society).
• People can use power poses to increase their testosterone, the dominance hormone, and lower their cortisol, the stress hormone (Psychological Science).
• Hurricanes are deadlier if they have female names (Proceedings of the National Academy of Sciences).
• Investors can earn a 23% annual return in the market by basing their buy/sell decisions on the number of Google searches for the word “debt” (scientific reports).
These now discredited studies are just the tip of a statistical iceberg known as the replication crisis.
A team led by John Ioannidis reviewed attempts to replicate 34 highly respected medical studies and found that only 20 had been confirmed. The Reproducibility Project attempted to replicate 97 studies published in leading psychology journals and confirmed only 35. The Experimental Economics Replication Project attempted to replicate 18 experimental studies published in economics journals prominent and only confirmed 11.
I wrote a satirical article that aimed to demonstrate the craziness of data mining. I looked at Donald Trump’s voluminous tweets and found statistically significant correlations between Trump tweeting the word “president” and the S&P 500 index two days later; Trump tweeting the word “never” and the temperature in Moscow four days later; Trump tweeted the word “more” and the price of tea in China four days later, and Trump tweeted the word “democrat” and some random numbers I had generated.
I concluded – in tongues as firmly as possible – that I had found “compelling evidence of the value of using data mining algorithms to uncover statistically compelling, hitherto unknown correlations that can be used to make reliable predictions”.
I naively assumed that readers would understand the point of this nerd joke: large datasets can easily be mined and tortured to identify patterns that are utterly useless. I submitted the article to an academic journal, and the reviewer’s comments beautifully demonstrate how deeply ingrained the notion that statistical significance trumps common sense is: “The article is generally well-written and structured. This is an interesting study, and the authors collected unique datasets using state-of-the-art methodology.”
It is tempting to believe that more data means more knowledge. However, the explosion in the number of things being measured and recorded has magnified beyond belief the number of coincidental patterns and false statistical relationships waiting to fool us.
Gary Smith is a professor of economics at Pomona College.