Statistically Funny: Nervously approaching significance

Saturday, March 9, 2013

Nervously approaching significance

We're deluged with claims that we should do this, that or the other thing because some study has a "statistically significant" result. But don't let this particular use of the word "significant" trip you up: when it's paired with "statistically", it doesn't mean it's necessarily important. Nor is it a magic number that means that something has been proven to work (or not to work).

The p-value on its own really tells you very little. It is one way of trying to tell whether the result is more or less likely to be "signal" than "noise". If a study sample is very small, only a big difference might reach that level, while it is far easier in a bigger study.

But statistical significance is not a way to prove the "truth" of a claim or hypothesis. What's more, you don't even need the p-value, because other measures tell you everything the p-value can tell you, and more useful things besides.

This is roughly how the statistical test behind the p-value works. The test is based on the assumption that what the study is looking for is not true - but instead, that the "null hypothesis" is true. The statistical test estimates whether you would expect the result you got, or one further away from "null" than that result, if the hypothesis isn't true.

If the p-value is <0.05 (less than 5%), then the result is compatible with what you would get if the hypothesis actually is true. But it doesn't prove it is true. You can't conclude too much based on that alone. The threshold of 0.05 for statistical significance means the level for the test has been set at 95%. That is common practice, but still a bit arbitrary.

You can read more about statistical significance over here in my blog, Absolutely Maybe - and in Data Bingo! Oh no! and Does it work? here at Statistically Funny.

Always keep in mind that a statistically significant result is not necessarily significant in the sense of "important". It's "significant" only in the sense of signifying something. A sliver of a difference could reach statistical significance if a study is big enough. For example, if one group of people sleeps a tiny bit longer on average a night than another group of people, that could be statistically significant. But it wouldn't be enough for one group of people to feel more rested than the other.

This is why people will often say something was statistically significant, but clinically unimportant, or not clinically significant. Clinical significance is a value judgment, often implying a difference that would change the decision that a clinician or patient would make. Others speak of a minimal clinically important difference (MCID or MID). That can mean they are talking about the minimum difference a patient could detect - but there is a lot of confusion around these terms.

Researchers and medical journals are more likely to trumpet "statistically significant" trial results to get attention from doctors and journalists, for example. Those medical journal articles are a key part of marketing pharmaceuticals, too. Selling copies of articles to drug companies is a major part of the business of many (but not all) medical journals.

And while I'm on the subject of medical journals, I need to declare my own relationship with one I've long admired: PLOS Medicine - an international open access journal. As well as being proud to have published there, I'm delighted to have recently joined their Editorial Board.

(This post was revised following Bruce Scott's comment below.)

9 comments:

Bruce ScottApril 9, 2013 at 7:28 PM
Yikes.

When you say: "The probability that it is a fluke is less than 5% (0.05 or 5 out of a 100)."

You've taken the view that the frequentist position is right.

If you ask a Bayesian, they'd say that you can't make that statement without having an a priori estimate of the likelihood.

You can argue that the frequentist position is the most useful to adopt. You can't argue that it is straightforwardly true or actually bears up to even the tiniest degree of scrutiny.

Take the thought experiment:

1) If someone tells me that their study shows that drinking a big glass of orange juice raises blood sugar (P<0.05), I'm happy to agree that the probability that this is just noise is less than 5%. (Much less, actually, and I'll wonder why the study needed to be done.)

2) If someone tells me that their study shows that drinking a big glass of orange juice cures metastatic pancreatic cancer (P<0.05), it would be perverse to agree that the result is just noise is only 5%.

Sorry for only commenting on the thing that bugged me. I found this site based on someone posting a link to a nice cartoon about multiple comparisons. Your batting average seems to be pretty good so far.
ReplyDelete
Replies
Hilda BastianApril 9, 2013 at 9:45 PM
Thanks for the comment - and glad you liked the Bonferroni one. Well, I'm not a Bayesian, it's true, but not entirely a frequentist either. I tried to make the explanation of the mathematical conception understandable, but I think I avoided saying it meant it must be true: I only spoke of probability, and linked back to another post explaining a certain proportion will always be wrong.

The frequentist heuristic is going to blow up errors, for sure. But so will the Bayesian one, if the priors are based on a paradigm that comes unraveled. Thanks for adding the orange juice example!
ReplyDelete
Replies
AnonymousApril 10, 2013 at 4:27 PM
Hello Hilda "Statistical significance is reached when a "p" value is less than 5% "
5% is merely convention at best or habit at worst. It is entirely arbitrary. While 5% seems to be used in medicine, in others, a much lower value is used (eg in the hunt for fundamental particles like the Higgs Boson). I would encourage the use of such cut-off only in the design of trials. In the methodology report 5% (or whatever) as the alpha, but not use the term "statistical significance" for a result with p<0.05, rather just report the p value.

John
ReplyDelete
Replies

Add comment