Sunday, July 28, 2013

Alleged effects include howling



When dogs howl at night, it's not the full moon that sets them off. Dogs are communicating for all sorts of reasons. We're just not all that good at understanding what they're saying.

We make so many mistakes about attributing cause and effect for so many reasons, that it's almost surprising we get it right as often as we do. But all those mistaken beliefs we realize we have, don't seem to teach us a lesson. Pretty soon after catching ourselves out, we're at it again, taking mental shortcuts, being cognitive misers.

It's so pervasive, you would think we would know this about ourselves, at least, even if we don't understand dogs. Yet we commonly under-estimate how much bias is affecting our beliefs. That's been dubbed the bias blind spot that we tend to live in.

Even taking all that into account, "effect" is an astonishingly over-used word, especially in research and science communication where you would hope people would be more careful. The maxim that correlation (happening at the same time) does not necessarily mean causation has spread far and wide, becoming something of a cliche along the way.

But does that mean that people are as careful with the use of the word "effect" as they are with the use of the "cause" word? Unfortunately not.

Take this common one: "Side effects include...." Well, actually, don't be so fast to swallow that one. Sometimes, genuine adverse effects will follow that phrase. But more often, the catalogue that follows is not adverse effects, but a list of adverse events - things that happened (or were reported). Some of them may be causally related, some might not be.

You have to look carefully at claims of benefits and harms. Even researchers who aren't particularly biased will word it carelessly. You will often hear that 14% experienced nausea, say - without it being pointed out that 13% of people on placebos also experienced nausea, and the difference wasn't statistically significant. Some adverse effects are well known, and it doesn't matter (diarrhea and antibiotics, say). That's not always so, though. (More on this: 5 Key Things to Know About Adverse Effects.)

If the word "effect" is over-used, the word "hypothesis" is under-used. Although generating hypotheses is a critical part of science, hypotheses aren't really marketed as what they are: ideas in need of testing. Often the language is that of attribution throughout, with a little fig-leaf of a sentence tacked on about the need for confirmatory studies. In fact, we cannot take replication and confirmation for granted at all.





Sunday, June 30, 2013

Goldilocks and the three reviews



Goldilocks is right: that review is FAR too complicated. The methods section alone is 652 pages long! Which wouldn't be too bad, if it weren't that it is a few years out of date. It took so long to do this review and go through rigorous enough quality review, it was already out of date the day it was released. Something that happens often enough to be rather disheartening.

When methodology for systematic reviewing gets overly rococo, the point of diminishing returns will be passed. That's a worry, for a few reasons. For one, it's inefficient and more reviews could be done with the resources. Secondly, more complex methodology can both be daunting, and it can be hard for researchers to accomplish with consistency. Thirdly, when a review gets very elaborate, reproducing or updating it isn't going to be easy either.

It's unavoidable for some reviews to be massive and complex undertakings, though, if they're going to get to the bottom of massive and complex questions. Goldilocks is right about review number 2, as well: that one is WAY too simple. And that's a serious problem, too.

Reviewing evidence needs to be a well-conducted research exercise. A great way to find out more about what goes wrong when it's not, is reading Testing Treatments. And see more on this here at Statistically Funny, too.

You need to check the methods section of every review before you take its conclusions seriously - even when it claims to be "evidence-based" or systematic. People can take far too many shortcuts. Fortunately, it's not often that a review gets as bad as the second one Goldilocks encountered here. The authors of that review decided to include only one trial for each drug "in order to keep the tables and figures to a manageable size." Gulp!

Getting to a good answer also quite simply takes some time and thought. Making real sense of evidence and the complexities of health, illness and disability is often just not suited to a "fast food" approach. As the scientists behind the Slow Science Manifesto point out, science needs time for thinking and digesting.

To cover more ground, people are looking for reasonable ways to cut corners, though. There are many kinds of rapid review, including reliance on previous systematic reviews for new reviews. These can be, but aren't always, rigorous enough for us to be confident about their conclusions.

You can see this process at work in the set of reviews discussed at Statistically Funny a few cartoons ago. Review number 3 there is in part based on review number 2 - without re-analysis. And then review number 4 is based on review number 3.

So if one review gets it wrong, other work may be built on weak foundations. Li and Dickersin suggest this might be a clue to the perpetuation of incorrect techniques in meta-analyses: reviewers who got it wrong in their review, were citing other reviews that had gotten it wrong, too. (That statistical technique, by the way, has its own cartoon.)

Luckily for Goldilocks, the bears had found a third review. It had sound methodology you can trust. It had been totally transparent from the start - included in PROSPERO, the international prospective register for systematic reviews. Goldilocks can get at the fully open review and its data are in the Systematic Review Data Repository, open to others to check and re-use. Ahhh - just right!


PS:

I'm grateful to the Wikipedians who put together the article on Goldilocks and the three bears. That article pointed me to the fascinating discussion of "the rule of three" and the hold this number has on our imaginations.

Sunday, June 23, 2013

Studies of cave paintings have shown....



The mammoth has a good point. Ogg's father is making a classic error of logic. Not having found proof that something really happens, is not the same as having definitive proof that this thing cannot possibly happen.

Ogg's family doesn't have the benefit of Aristotle's explanation of deductive reasoning. But even two thousand years after Aristotle got started, we still often fall into this trap.

In evidence-based medicine, a part of this problem is touched on by the saying, "absence of evidence is not evidence of absence." A study says "there's no evidence" of a positive effect, and people jump to the conclusion - "it doesn't work." Baby Ogg gets thrown out with the bathwater.

The same thing is happening when there are no statistically significant serious adverse effects reported, and people infer from that, "it's safe." 

This situation is the opposite of the problem of reading too much into a finding of statistical significance (explained here). Only in this case, people are over-interpreting non-significance. Maybe the researchers simply didn't study enough of the right people, or they weren't looking at the outcomes that later turn out to be critical.

Researchers themselves can over-interpret negative results. Or they might phrase their conclusions carelessly. Even if they avoid the language pitfalls here, journalists could miss the nuance (or think the researchers are just being wishy-washy) and spread the wrong message. And even if everyone else phrased it carefully, the reader might jump to that conclusion anyway.

When researchers say "there is no evidence that...", they generally mean they didn't find any, or enough of, a particular type of evidence that they would find convincing. Obviously, no one can ever be sure they have even seen all the evidence. And it doesn't mean everyone would agree with their conclusion, either. To be reasonably sure of a negative, you might need quite a lot of evidence.

On the other hand, the probability of something being extremely unlikely to be real based on quite a lot of knowledge - that there's a community of giant blue swans with orange and pink polka dots on the Nile, say - increases the confidence you might have in even a small study exploring that hypothesis.

In 2020 during the Covid-19 pandemic, we found out how deeply another problem goes: taking the absence of particular types of evidence as the rationale for not taking public health action. Early in April I wrote in WIRED about how this was leading us to policies that didn't make sense – especially in not recommending personal masks to help reduce community transmission. At the same time, Trisha Greenhalgh and colleagues pointed out that was ignoring the precautionary principle: it's important to avoid doing harm caused by not taking other forms of evidence seriously enough. When it was finally acknowledged that the policy had to change, it was a recipe for chaos.

Which brings us to the other side of this coin. Proving that something doesn't exist to the satisfaction of people who perhaps need to believe it most earnestly, can be quite impossible. People trying to disprove the claim that vaccination causes autism, for example, are finding that despite the Enlightenment, our rational side can be vulnerable to highjacking. Voltaire hit that nail on the head in the 18th century: "The interest I have to believe a thing is no proof that such a thing exists."


Voltaire quote from 1763 with a cartoon pic of a man from that period: "The interest I have to believe a thing is no proof that such a thing exists."




~~~~

Update 3 July 2020: Covid-19 paragraph added.


Tuesday, May 21, 2013

He said, she said, then they said...



Conflicting studies can make life tough. A good systematic review could sort it out. It might be possible for the studies to be pooled into a meta-analysis. That can show you the spread of individual study results and what they add up to, at the same time.

But what about when systematic reviews disagree? When the "he said, she said" of conflicting studies goes meta, it can be even more confusing. New layers of disagreement get piled onto the layers from the original research. Yikes! This post is going to be tough-going...

A group of us defined this discordance among reviews as: the review authors disagree about whether or not there is an effect, or the direction of effect differs between reviews. A difference in direction of effect can mean one review gives a "thumbs up" and another a "thumbs down."

Some people are surprised that this happens. But it's inevitable. Sometimes you need to read several systematic reviews to get your head around a body of evidence. Different groups of people approach even the same question in different but equally legitimate ways. And there are lots of different judgment calls people can make along the way. Those decisions can change the results the systematic review will get.

When and how they searched for studies - and what type and subject - means that it's not at all unusual for groups of reviewers to be looking at different sets of studies for much the same question.

After all that, different groups of people can interpret evidence differently. They often make different judgments about the quality of a study or part of one - and that could dramatically affect its value and meaning to them.

It's a little like watching a game of football where there are several teams on the field at once. Some of the players are on all the teams, but some are playing for only one or two. Each team has goal posts in slightly different places - and each team isn't necessarily playing by the same rules. And there's no umpire.

Here's an example of how you can end up with controversy and people taking different positions even when there's a systematic review. The area of some disagreement in this subset of reviews is about psychological intervention after trauma to prevent post-traumatic stress disorder (PTSD) or other problems:

Published in 2002Published in 2005Published in 2005Published in 2010; Published in 2012Published in 2013.

The conclusions range from saying debriefing has a large benefit to saying there is no evidence of benefit and it seems to cause some PTSD. Most of the others, but not all, fall somewhere in between, leaning to "we can't really be sure". Most are based only on randomized trials, but one has none, and one has a mixture of study types.

The authors are sometimes big independent national or international agencies. A couple of others include authors of the studies they are reviewing. The definition of trauma isn't the same - they may or may not include childbirth, for example. The interventions aren't the same.

The quality of evidence is very low. And the biggest discordance - whether or not there is evidence of harm - hinges mostly on how much weight you put on one trial.

It's about debriefing. The debriefing group is much bigger than the control group because they stopped the trial early, and while it's complicated, that can be a source of bias.

The people in the debriefing group were at quite a lot higher risk of PTSD in the first place. Data for more than 20% of the people randomized is missing - and that biases the results too (it's called attrition bias). You can't be sure those people didn't return because they were depressed, for example. If so, that could change the results.

It's no wonder there's still a controversy here.


See also my 5 tips for understanding data in meta-analysis.

Links to key papers about this in my comment at PubMed Commons (archived here).


If you want to read more about debriefing, here's my post in Scientific American: Dissecting the controversy about early psychological response to disasters and trauma.


Thursday, May 9, 2013

They just Google THAT?!


I admit I needed Google to quickly find out that the category for bunny-shaped clouds is "zoomorphic". And I think Google is wonderful - and so does Tess. But...

There's just been another study published about the latest generation of doctors and their information and searching habits. Like Tess' friend, they rely pretty heavily on Googling. We could all be over-estimating, though, just how good people are at finding things with Google - including the biomedically trained.

Many of us assume that the "Google generation" or "digital natives" are as good at finding information as they are at using technology. A review in 2008 came to the conclusion that this was "a dangerous myth" [PDF] and those things don't go hand in hand. It may not have gotten any better since then either.

Information literacy is about knowing when you need information, and knowing how to find and evaluate it. Google leads us to information that the crowd is basically endorsing. If the crowd has poor information literacy in health, then that can reinforce the problem.

This is an added complication for health consumers. While there's an increasing expectation that healthcare system decisions and clinical decisions be based on rigorous assessments of evidence, that's not really trickling down very fast. Patient information is generally still pretty old school.

What would it mean for patient information to be really evidence-based? I believe it includes using methods to minimize bias in finding and evaluating research to base the information on, and using evidence-based communication. Those ideas are gaining ground, for example in standards in England and Germany, and this evaluation by WHO Europe of one group of us putting these concepts into practice.

Missing critical information that can shift the picture is one of the most common ways that reviews of research can get it wrong. For systematic reviews of evidence, searching for information well is a critical and complex task.

This brings us to why Tess' talents, passions and chosen career are so important. We need health information specialists and librarians to link us with good information in many ways.

This week at the excellent annual meeting of the Medical Library Association in Boston (think lots of wonderful Tess'es!), there was a poster by Whitney Townsend and her colleagues at the Taubman Health Sciences Library (University of Michigan). Their assessment of 368 systematic reviews suggests that even systematic reviewers need help searching.

Google's great, but it doesn't mean we don't still need to "go to the library."


(Disclosure: I work in a library these days - the world's largest medical one at the National Institute of Health (NIH). If this has put you in the mood for honing up your searching skills, there are some tips for searching PubMed Health here.)


Tuesday, April 23, 2013

Women and children overboard



It's the Catch-22 of clinical trials: to protect pregnant women and children from the risks of untested drugs....we don't test drugs adequately for them.

In the last few decades, we've been more concerned about the harms of research than of inadequately tested treatments for everyone, in fact. But for "vulnerable populations," like pregnant women and children, the default was to exclude them.

And just in case any women might be, or might become, pregnant, it was often easier just to exclude us all from trials.

It got so bad, that by the late 1990s, the FDA realized regulations and more for pregnant women - and women generally - had to change. The NIH (National Institutes of Health) took action too. And so few drugs had enough safety and efficacy information for children that, even in official circles, children were being called "therapeutic orphans." Action began on that, too.

There is still a long way to go. But this month there was a sign that maybe times really are changing. The FDA approved Diclegis for nausea and vomiting in pregnancy. It's a new formulation of the key ingredients of Bendectin, the only other drug ever approved for that purpose in the USA. Nothing else has been shown to work.

Thirty years ago, the manufacturer withdrew Bendectin from the market because it was too expensive to keep defending it in the courts. It's a gripping story, involving the media, activists, junk science and some fraud. It had a major influence on clinical research, public opinion and more. You can read more about it in my guest blog at Scientific American, Catch-22, clinical trial edition: the double bind for women and children.

In dozens of court cases over Bendectin, judges and juries struggled with competing testimony about scientific evidence. In one hearing, a judge offered the unusual option of a "blue ribbon jury" or a "blue, blue ribbon jury": selecting only people who would be qualified to understand the complex testimony and issues of causation. The plaintiffs refused.

Ultimately, in one of the Bendectin cases, Daubert versus Merrell Dow Pharmaceuticals, the Supreme Court re-defined the rules around scientific evidence for US courts. The previous Frye Rule called for consensus. The 1972 Federal Rules of Evidence said "all relevant evidence is admissible."

The new Daubert standard determined that evidence must be "reliable" - grounded in "the methods and procedures of science" - not just relevant.

We still need everyone involved to better understand what reliable scientific evidence on clinical effects really means, though. You can read more about that here at Statistically Funny.


Tuesday, April 9, 2013

Look, Ma - straight A's!



Unfortunately, little Suzy isn't the only one falling for the temptation to dismiss or explain away inconvenient performance data. Healthcare is riddled with this, as people pick and choose studies that are easy to find or that prove their points.

In fact, most reviews of healthcare evidence don't go through the painstaking processes needed to systematically minimize bias and show a fair picture.

A fully systematic review very specifically lays out a question and how it's going to be answered. Then the researchers stick to that study plan, no matter how welcome or unwelcome the results. They go to great lengths to find the studies that have looked at their question, and they analyze the quality and meaning of what they find.

The researchers might do a meta-analysis - a statistical technique to combine the results of studies (explained here at Statistically Funny). But you can have a systematic review without a meta-analysis - and you can do a meta-analysis of a group of studies without doing a systematic review.

To help make it easier for people to sift out the fully systematic from the less thorough reviews, a group of us, led by Elaine Beller, have just published guidelines for abstracts of systematic reviews. It's part of the PRISMA Statement initiative to improve reporting of systematic reviews. A quick way to find systematic reviews is the Epistemonikos database.

Do systematic reviews entirely solve the problem Julie saw with those school grades? Unfortunately, not always. Many trials aren't even published at all, and no amount of searching or digging can get to them. This happens even when the trial has good news, but it happens more often with disappointing results. The "fails" can be very well-hidden. Yes, it's as bad as it sounds: Ben Goldacre explains the problem and its consequences here.

You can help by signing up to the All Trials campaign - please do, and encourage everyone you know to do it too. Healthcare interventions simply won't all be able to have reliable report cards until the trials are not just done, but easy to get at.


Interest declaration: I'm the editor of PubMed Health and on the editorial advisory board of PLOS Medicine.


Sunday, April 7, 2013

Don't worry ... it's just a standard deviation


Of course, every time Cynthia and Gregory make the 8-block downtown trip to the Stinsons, it's going to take a different amount of time, depending on traffic and so on - even if it only varies by a minute or two.

Most of the time, the trip to the Stinsons' apartment would take between 10 minutes (in the middle of the night) and 45 minutes (in peak hour). Giving a range like that is similar to the concept of a margin of error or confidence interval (explained here).

So what's a standard deviation and what does it tell you? Well, it's not a comment on Gregory's behavior! Deviance as a term for abnormal behavior is an invention of the 1940s and '50s. Standard deviation (or SD) is a statistical term first used in 1894 by one of the key figures in modern statistics, Karl Pearson.

The standard deviation shows how far results are from the mean (or average). The standard deviation will be bigger when the numbers are more spread out, and smaller when there's not a huge amount of difference.

Lots of results will cluster within 1 standard deviation of the mean, and most will be within 2 standard deviations. Roughly like this:





95% of results are going to be within 2 standard deviations in either direction from the mean.You can read about how 95% (or 0.05) came to have this significance here. Statistical significance is explained here at Statistically Funny.

From the standard deviation, it's just a hop, skip, and jump to the standardized mean difference! More about that and an introduction to the mean generally here at Statistically Funny.




Monday, March 25, 2013

Every move you make....Are you watching you?


Monitoring....There's something about getting something into numbers and targets that just makes it seem to be so controllable, isn't there? And many people - including many doctors - just love gadgets and measuring things. No wonder there is so much monitoring in health and fitness.

Actually, there's too much monitoring in some health matters. Some monitoring could cause anxiety without benefit, or lead to actions that do more harm than good.

Professor Paul Glasziou, author of Evidence-Based Monitoring, talked about this on Monday at Evidence Live. For monitoring to be effective there has to be:
  • valid and accurate measurement,
  • informed interpretation, and
  • effective action that can be taken on the results.  
Then there has to be an effective monitoring regimen.

None of that is simple. Frequent testing can mean you end up acting on random variations, not real changes in health. There's more at Statistically Funny about when statistical significance can mislead and the statistical risks of multiple testing.

Self-monitoring can be a path to freedom and better health in some circumstances - if you use insulin or an anticoagulant like warfarin, for instance. But constant monitoring of everything you can measure is a whole other kettle of fish. You can read more about this, monitoring apps and 'the quantified self' in my guest blog at Scientific American: 'Every breath you take, Every move you make...' How much monitoring is too much?

Saturday, March 9, 2013

Nervously approaching significance



We're deluged with claims that we should do this, that or the other thing because some study has a "statistically significant" result. But don't let this particular use of the word "significant" trip you up: when it's paired with "statistically", it doesn't mean it's necessarily important. Nor is it a magic number that means that something has been proven to work (or not to work).

The p-value on its own really tells you very little. It is one way of trying to tell whether the result is more or less likely to be "signal" than "noise".  If a study sample is very small, only a big difference might reach that level, while it is far easier in a bigger study.

But statistical significance is not a way to prove the "truth" of a claim or hypothesis. What's more, you don't even need the p-value, because other measures tell you everything the p-value can tell you, and more useful things besides.

This is roughly how the statistical test behind the p-value works. The test is based on the assumption that what the study is looking for is not true - but instead, that the "null hypothesis" is true. The statistical test estimates whether you would expect the result you got, or one further away from "null" than that result, if the hypothesis isn't true.

If the p-value is <0.05 (less than 5%), then the result is compatible with what you would get if the hypothesis actually is true. But it doesn't prove it is true. You can't conclude too much based on that alone. The threshold of 0.05 for statistical significance means the level for the test has been set at 95%. That is common practice, but still a bit arbitrary.

You can read more about statistical significance over here in my blog, Absolutely Maybe - and in Data Bingo! Oh no! and Does it work? here at Statistically Funny.

Always keep in mind that a statistically significant result is not necessarily significant in the sense of "important". It's "significant" only in the sense of signifying something. A sliver of a difference could reach statistical significance if a study is big enough. For example, if one group of people sleeps a tiny bit longer on average a night than another group of people, that could be statistically significant. But it wouldn't be enough for one group of people to feel more rested than the other.

This is why people will often say something was statistically significant, but clinically unimportant, or not clinically significant. Clinical significance is a value judgment, often implying a difference that would change the decision that a clinician or patient would make. Others speak of a minimal clinically important difference (MCID or MID). That can mean they are talking about the minimum difference a patient could detect - but there is a lot of confusion around these terms.

Researchers and medical journals are more likely to trumpet "statistically significant" trial results to get attention from doctors and journalists, for example. Those medical journal articles are a key part of marketing pharmaceuticals, too. Selling copies of articles to drug companies is a major part of the business of many (but not all) medical journals. 

And while I'm on the subject of medical journals, I need to declare my own relationship with one I've long admired: PLOS Medicine - an international open access journal. As well as being proud to have published there, I'm delighted to have recently joined their Editorial Board.


(This post was revised following Bruce Scott's comment below.)