Oh boy - look what a data hunter has dragged in this time! Why is this problem so common? And who on earth is Bonferroni?
Our friend here found one "statistically significant" result when he looked at goodness knows how many differences between groups of people. He's fallen totally for a statistical illusion that's a hazard of 'multiple testing'. And a lot of headline writers and readers will fall for it, too.
Then he's made it worse by taking his unproven hypothesis (that a particular drink on a particular day in a particular group of people prevented stroke) and whacking on another unproven hypothesis (that if everyone else drinks lots of it, benefits will ensue). But it's the problem of multiple testing (also called multiplicity) where Bonferroni comes in.
It's pretty much inevitable that multiple testing will churn out some wrong answers. Something that the Italian mathematician, Carlo Bonferroni (1892-1960), figured out how to analyze.
A "statistically significant" difference between groups of people means that more than 95 times out of a 100, roughly the same difference is likely to be experienced by other similar groups of people in similar circumstances. That's a high probability of being right. Or put another way, it's less than a 5/100 or 5% probability of being wrong (a "p" value of less than 0.05).
If you test for multiple possibilities, you need to expect even your statistically significant "findings" to be wrong on average 5 times out of a 100 (or 1 in 20 findings). If you test only a few things, your chances of this kind of random error is very low.
But especially if you have a big dataset, the more things you look at, the higher the chance is that you'll drag total flukes out. With high-powered computers crunching big data, this becomes a big problem - large numbers of spurious findings that can't be replicated.
Bonferroni's name graces some statistical tests used to interpret results when doing multiple tests. There are others. Some are concerned that techniques based on Bonferroni are too conservative - too likely to throw the baby out with the water, if you like. So they use tests that have a different basis, such as the False Discovery Rate (FDR).
Statistical tests can't totally eliminate the chance of random error, though. So you usually need more than just a single possibly random test result to be sure about something.
If you're interested in how to communicate statistics accurately and well, check out Session 2G at Science Online this week: Evelyn Lamb and I are co-moderating. Follow on Twitter with #PublicStats (#Scio13).
Getting more technical...
What about multiplicity issues in systematic reviews? As the Cochrane Handbook (section 16.7.2) points out, systematic reviews concentrate on estimating pre-specified effects - not searching for possible effects. Safeguards still matter, though. Even pre-specified analyses need to be kept to a minimum. And how many analyses were done needs to be kept in mind when interpreting results.
If you would like to read more technical information about multiple testing, here are some free slides from the University of Washington. And if you want to read more about the controversies and issues, here's a primer in Nature and an article in the Journal of Clinical Epidemiology (behind paywalls).