P-hacking is data diving, data fishing, data mining, or any other term (dredging, snooping, etc.) that describes manipulating or rearranging the results of a study in enough different ways that SOMETHING significant eventually emerges. It doesn’t require manipulation in the sense of making up, fudging or changing any of your numbers or other data. It simply means a researcher plays around with the numbers enough that they can eventually uncover an association based purely on the statistical likelihood that if you look hard enough for something in enough different places, you’ll eventually find something. As one study described it, “researchers collect or select data or statistical analyses until nonsignificant results become significant.” Or, more colloquially, “if you torture the data long enough, it will confess.”
Deeper dive
One of the best explanations of P hacking is in Christie Aschwanden’s award-winning piece Science Isn’t Broken, which we’ve previously written about. The ability to P hack rests on the disadvantage of randomly establishing a threshold for statistical significance that will virtually guarantee coming up with SOMETHING significant if a study has enough variables and/or outcome possibilities. A P-value is intended to represent a proportion of times an experiment or study could be run and give the same results by chance. Considering that the arbitrary cutoff for significance is generally accepted as a P value of 0.05, that means that you would expect to get the same results seen in a particular study with statistical significance just under 5% of the time. So imagine that you’re looking for 12 different possible outcomes and how they’re associated with 8 different variables. That’s basically 96 possible combinations that could result. But that’s not much different than running a study 100 times — statistically speaking, about 5 of those combinations would probably show an association simply based on chance alone.
To watch out for P hacking, look at how many outcomes researchers are looking at, how many variables they’re considering in the study, and how many subgroups they consider and/or subanalyses they calculate. The more they have, the more likely they’ll “find” something by chance alone, and the more skeptical you should be that the finding is “real.”