Monday, December 4, 2017

A Science Fortune Cookie

This fortune cookie could start a few scuffles. It's offering a cheerful scenario if you are looking for a benefit of a treatment, for example. But it sure would suck if you are measuring a harm! That's not what's contentious about it, though.

It's the p values and their size that can get things very heated. The p value is the result you get from a standard test for statistical significance. It can't tell you if a hypothesis is true or not, or rule out coincidence. What it can do is measure an actual result against a theoretical expectation, and let you know if this is pretty much what you would expect to see if a hypothesis is true. The smaller it is, the better: statistical significance is high when the p value is low. Statistical hypothesis testing is all a bit Alice-in-Wonderland!

As if it wasn't already complicated enough, people have been dividing rapidly into camps on p values lately. The p value has defenders - we shouldn't dump on the test, just because people misuse it, they say (here). Then there are those who think it should be abandoned or at least very heavily demoted (here and here, for example).

Then there is the camp in favor of raising the bar by lowering the level for p values. In September 2017, a bunch of heavy-hitters say the time has come to expect p values to be much tinier, at least when something new is claimed (here).

How tiny are they saying a p should be? The usual threshold has been p <0.05 (less than 5%). Instead of that being a significant finding, they decided, just a bit less than 0.05 should only be called "suggestive" of a significant finding. A significant new finding should be way tinier: <0.005.

That camp reckons support for this change has reached critical mass. Which is suggestive of the <0.05 threshold going the way of the dodo. I have no idea what the fortune cookie on that says! (If you want to read more on avoiding p value potholes, check out my 5 tips on Absolutely Maybe.)

Now let's get back to the core message of our fortune cookie: the size of a p value is a completely separate issue from the size of the effect. That's because the size of a p value is heavily affected by the size of the study. You can have a highly statistically significant p value for a difference of no real consequence.

There's another trap: an important effect might be real, but the study was too small to know for sure. Here's an example. It's a clinical trial of getting people to watch a video about clinical trials, before going through the standard informed consent process to join a hypothetical clinical trial. The control group went through the same consent process, but without the video.

The researchers looked for possible effects on a particular misconception, and on willingness to sign up for a trial. They concluded this (I added the bold):

An enhanced educational intervention augmenting traditional informed consent led to a meaningful reduction in therapeutic misconception without a statistically significant change in willingness to enroll in hypothetical clinical trials.

You need to look carefully when you see statements like this one. You might not be getting an accurate impression. Later, the researchers report:

[T]his study was powered to detect a difference in therapeutic misconception score but not willingness to participate.

That means they worked out how many people they needed to recruit based only on what was needed to detect a difference of several points in the average misconception scores. Willingness to join a trial dropped by a few percentage points, but the difference wasn't statistically significant. That could mean it doesn't really reduce willingness - or it could mean the study was too small to answer the question. There's just a big question mark: this video reduced misconception, and a reduction in willingness to participate can't be ruled out.

What about the effect size? That is how big (or little) the difference between groups is. There are many different ways to measure it. For example, in this trial, "willingness to participate" was simply the proportion of people who said "yes" or "no".

However, the difference in "misconception" in that trial was measured by comparing mean results people scored on a test of their understanding. You can brush up on means, and how that leads you to standard deviations and standardized mean differences here at Statistically Funny.

There are other specific techniques used to set levels of what effect size matters - but those are for another day. In the meantime, there's a technical article explaining important clinical differences here. And another on Cohen's d, a measure that is often used in psychological studies. It comes with this rule of thumb: 0.2 is a small effect, 0.5 is medium, and 0.8 is a large effect.

Study reports should allow you to come to your own judgment about whether an effect matters or not. May the next research report you read be written by people who make that easy!

Number needed to confuse: read more at Statistically Funny on the objectivity - or not! - in ways of communicating about effects.