Statistically Funny

Motivated reasoning alert: YOU are at risk!

2025-06-06T03:27:00.002-04:00

There are so many ways bias goes under our radars and derails our judgment. Motivated reasoning and its tricky sidekick, confirmation bias, are some of the worst. Motivated reasoning allows our feelings and beliefs to to steer us to what feels like a rational and objective position. Confirmation bias lets us cherry pick evidence to suit: A study that confirms our beliefs gets a free pass, while we nitpick a challenging study to death – and then the findings that shore up our argument build up in our minds.

Digital and social media is an ideal force for super-charging all this: It's great at stirring up feelings and urgency so that people spread studies to a tribe of like-minded people without paying much – if any – critical attention to the content. Speed and emotion are natural enemies of objectivity.

All this keeps convincing us that the conclusion we want to have is the right one – and it's such a time saver! It's a big problem, though, if you want your beliefs to be an accurate view of reality.

Hilda Bastian

6 June 2025

A study in misplaced scientific flair

2023-06-24T01:36:00.002-04:00

It's the most-read part of scientific articles. Abstracts are supposed to give you a quick impression of research results. But while they may be small, they're crammed with traps. Hype in scientific articles seems to be escalating, and abstracts – those little study blurbs – tend to be the hypey-est part.

Of course, sometimes scientists have already gotten creative with a biased title for the article – bonus "points" for getting in first even earlier, with the study name or its acronym. (More on the acronymania menace here at Statistically Funny.)

What does this mean for us as readers? It means we can't be sure about the real takeaway messages from a study based on the abstract alone. Which, frankly, sucks.

There's some good news at least: There's some serious research into the problem. And there are some telltale signs. I reckon you have to keep a sharp eye out for adjectives and adverbs – that's where things often take a spicy turn.

In May 2023, Olivier Corneille and colleagues published a list of 22 persuasive communication devices to watch out for in academic papers. Gulp! There's a summary of the list in table 1.

On the plus side, today's hype words mightn't work for long, thanks to "semantic bleaching". That's when overuse of hyperbole "'bleaches' out the stronger meaning of the word." Though I guess there will always be new buzz words to take their place. Sigh!

My quick checklist for building a great scientific abstract.
Other Absolutely Maybe posts tagged "abstracts".

Disclosure: I'm a co-author of the 2013 PRISMA reporting guidelines for abstracts of systematic reviews. I have unfortunately contributed to the huge pile of conference abstracts that have never been followed by published papers, including a couple that evaluated the quality of abstracts (1998a, 1998b).

So, so many questions!

2022-12-14T01:22:00.002-05:00

I first encountered science seriously as a health consumer advocate, a very long time ago. And I thought of medical and health research as a search for answers. Scientists were problem-solvers, using rigorous testing to sort out the wheat of reliable answers from the chaff of the false leads.

But over time, as I watched the research pile up exponentially, the number of questions was zooming up even faster.

We find out treatment A works. Who – and what – else could it work for? Can you have half the dose and still get just as much benefit? Does double the dose do more good than harm? Will it work even better if you combine it with treatment B? Combine it with treatment C? Combine it with B and C? Is it better than old treatment Q? Will work it in gel form? In spray? . . .

Turns out scientific studies are, in fact, a great way to generate more questions. Answers, on the other hand, are often elusive.

So whose questions does science work with? For most of science’s history, the work has been dominated by one gender and one race, from just a few countries. The barriers to scholars from the Global South and access to, and recognition of, their work internationally remain shockingly high. On top of that, scientists typically worked at arm’s length from the people affected by the problems they were trying to solve. The result was often a very narrow set of questions.

Consider the experience of clinical researchers in rheumatology. It wasn’t until after the field embraced consumer participation that fatigue, sleep, and disease flares were seen as important outcomes to measure. And the process changed their research culture, too.

Diversity in science is critical, too, to bring a variety of perspectives to the question-asking table. That includes diversity of disciplines in scientific teams. Working across disciplines doesn’t just bring different points of views into the process. It can be essential for scientific quality. Poor scientific methods that have been rejected in some disciplines, persist in other parts of academia. Interdisciplinary science might be able to spread superior scientific methods into fields with weaker science.

If questions are so important to how our knowledge grows – and they are – then the issue of who gets to ask those questions is a fundamental concern. It can have a profound impact on what questions get addressed at all, and what is seen.

(This post is based on a 2019 post at Absolutely Maybe, drawing on the point on interdisciplinary science elaborated on in a 2022 post at Living With Evidence.)

Some studies are MONSTERS!

2022-11-21T20:14:00.001-05:00

On the plus side, this jerk explains a lot about the data in a meta-analysis!

This cartoon is a forest plot, a style of data visualization for meta-analysis results. Some people call them "blobbograms". Each of these horizontal lines with a square in the middle represents the results of a different study. The length of that horizontal line represents the length of the confidence interval (CI). That gives you an estimate of how much uncertainty there is around that result - the shorter it is, the more confident we can be about the result. (Statistically Funny explainer here.)

The square is called the point estimate - the study's "result" if you like. Often, it's sized according to how much weight the study has in the meta-analysis. The bigger it is, the more confident we can be about the result.

The size of the point estimate is echoing the length of the confidence interval. They are two perspectives on the same information. Small square and long line provides less confidence than a big square with a short line.

The diamond here is called the summary estimate. It represents the summary of the results from the 3 studies combined. It doesn't just add up the 3 results then divide them by 3. It's a weighted average. Bigger studies with more events count for more. (More on that later.)

The left and right tips of the diamond are the two ends of the confidence interval. With each study that gets added to the plot, those tips will get closer together, and it will move left or right if a study's result tips the scales in one direction.

The vertical line in the center is the "line of no effect". If a result touches or crosses it, then the result is not statistically significant. (That's a tricky concept: my explainer here.)

In biomedicine, forest plots are the norm. But in other fields, like psychology, the results of meta-analyses are often presented as tables of data. That means that each data point - the start and end of each confidence interval, and so on - are numbers in a column instead of plotted on a graph. (Here's a study that does that.)

So what about that jerk? He carries so much weight not just because the study has a lot of participants in it. What's called a study's precision depends on the number of "events" in the study, too.

Say the event you’re interested in is heart attacks – and you are investigating a method for reducing them. But for whatever reason, not a single person in the experimental or control group has a heart attack even though the study was big enough for you to have expected several. That study would have less ability to detect any difference your method could have made, so the study would have less weight.

It's very common for a study, or a couple of them, to carry most of the weight in a meta-analysis. A study by Paul Glasziou and colleagues found that the trial with the most precision carried an average of 51% of the whole result. When that's the case, you really want to understand that study.

Some studies are such whoppers that they overpower all other studies – no matter how many of them there are. They may never be challenged, just because of their sheer size: No one might ever do a study that large on the same question again.

The size of the point estimate and length of the line around it are clues to the weight of the study. The meta-analysis might also include the percentages of weight for each study.

Like to know more? This is a shorter version of one of the tips in my post at Absolutely Maybe, 5 Tips for Understanding Data in Meta-Analyses. Check it out for a more in-depth example of looking at the weight of a study and 4 more key tips!

Hilda

Researching our way to better research?

2022-10-31T16:42:00.001-04:00

Here we see an expert in evidence synthesis meet a metascientist!

Evidence synthesis is an umbrella term for the work of finding and making sense of a body of research – methods like systematic reviews and meta-analysis. And metascience is studying the methods of science itself. It includes studying the way science is published – see for example my posts on peer review research. And yes, there's metascience on evidence synthesis, too – and syntheses of metascience!

The terms metascience and metaresearch haven't been tossed around for all that long, compared to other types of science. Back when I took my first steps down this road in the early 1990s, in my neck of the science woods we called people who did this methodologists. A guiding light for us was the statistician and all-round fantastic human Doug Altman (1948-2018). He wrote a rousing editorial in 1994 called "The scandal of poor medical research," declaring "We need less research, better research, and research done for the right reasons." Still true, of course.

Altman and colleague, Iveta Simera, chart the early history of metascience over at the James Lind Library. Box 1 in that piece has a collection of scathing quotes about poor research methodology, starting in 1917 with this one on clinical evidence: "A little thought suffices to show that the greater part cannot be taken as serious evidence at all."

The first piece of research on research that they identified was published – with only the briefest of detail, unfortunately – by Halbert Dunn in 1929. He analyzed 200 quantitative papers, and concluded, "About half of the papers should never have been published as they stood." (It's on the second page here.)

The first detailed report came in 1966, by a statistician and medical student. They reckoned over 70% of the papers they examined should either have been rejected or revised before being published. A few years after that, the methods for evidence synthesis took an important step forward when Richard Light and Paul Smith published their "procedures for resolving contradictions among different research studies" (Light and Smith, 1971.)

Evidence synthesis and metascience have proliferated wildly since the 1990s. And there's lots of the better research that Altman hoped for, too. Unfortunately, though, it's still in the minority – even in evidence synthesis. Sigh! Will more research on research help? Someone should do research on that!

Hilda Bastian

Trial participants and the luck of the draw

2022-10-11T02:47:00.005-04:00

The guy in this cartoon really drew a short straw: most clinical trial participants, at least, know they were in a study. On the other hand, he was lucky that he was getting to hear from the researchers about the study's results! That used to be quite unlikely.

It might be getting better: a survey of trial authors from 2014-2015 found that half said they'd communicated results to participants. That survey had a low response rate – about 16% – so it might not be the best guide. There are quite a few studies these days on how to communicate results to participants, though, and that could be a good sign. (A systematic review of those studies is on the way, and I'll be keeping an eye out for it.)

Was our guy lucky to be in a clinical trial in the first place, or was he taking on a serious risk of harm?

An older review of trials (up to 2010) across a range of diseases and interventions found no major difference: trial participants weren't apparently more likely to benefit or be harmed. Another in women's health trials (up to 2015) concluded women who participated in clinical trials did better than those who didn't. And a recent one in pregnant women (up to May 2022) concluded there was no major difference. All of this, though, relies on data from a tiny proportion of all the trials that people participate in – and we don't even know the results of many of them.

I think a really thorough answer to this question would have to differentiate the types of trials. For perspective, consider clinical trials of drugs. Across the board, roughly 60% of drugs that get to phase 1 (very small early safety trials) or phase 2 (mid-stage small trials) don't make it the next phase. Most of the drugs that get to phase 3 (big efficacy trials) end up being approved: over 90% in 2015. The rate is higher than average for vaccines, and much lower for drugs for some diseases than others.

Not progressing to the next stage doesn't tell us if people in the trials benefited or were harmed on balance, but it shows why the answer to the question of impact on individual participants could be different for different types of trials.

So was the guy in the cartoon above lucky to be in a clinical trial? The answer is a very unsatisfactory, it depends on his specific trial! However, overall, there's no strong evidence of benefit or harm.

On the other hand, not doing trials at all would be a very risky proposition for the whole community. No matter which way you look at it, the rest of us have a lot of reasons to be very grateful to the people who participate in clinical trials: thank you all!

If you're interested in reading more about the history of people claiming either that participating in clinical trials is inherently risky or inherently beneficial, I dug into this in a post at Absolutely Maybe in 2020.

How are you?

2022-10-03T00:17:00.005-04:00

A simple question, theoretically, has a simple answer. That's not necessarily the case in a clinical trial, though. To measure in a way that can detect differences between groups, researchers often have to use methods that bear no relationship to how we think of a problem, or usually describe it.

Pain is a classic example. We use a lot of vivid words to try to explain our pain. But in a typical health study, that will be standardized. If that were done with what's called a "dichotomous" outcome – a straight up "yes or no" type question – it can be easy to understand the result.

But outcomes like pain can be measured on a scale, which is a "continuous" outcome: how bad is that pain, from nothing to the worst you can imagine? By the time average results between groups of people's scores get compared, it can be hard to translate back into something that makes sense. That's what the woman in the cartoon here is doing: comparing herself to people on a scale.

It pays to never put too much weight on the name of an outcome – check what it really means in the context of that study. There could be fine print that would make a difference to you – for example, “mortality” is measured, but only in the short-term, and you might have to dig hard to figure that our. Or the name the outcome is given might not be what it sounds like at all. People use the same names for outcomes they measure very differently.

Even something that sounds cut and dried can be...complicated. “Perinatal mortality” – death around childbirth – starts and ends at different times before and after birth, from country to country. “Stroke” might mean any kind, or some kinds. And then there's the complexity of composite outcomes – where multiple outcomes are combined and treated as if they're a single one. More on that here at Statistically Funny.

Some researchers put in the hard work of interpreting study measures to make sense in human terms. It would help the rest of us if more of them did that!

More posts on outcomes at Statistically Funny

And what's a standard deviation from the mean?

This post is based on a section of a post on deciphering outcomes in clinical trials at my Absolutely Maybe blog.

Hilda Bastian

In clinical trials, you can have it both ways

2021-03-31T19:53:00.002-04:00

"Were you in the vax group or the placebo?" It sounds like a simple question, that should have a simple answer, right? And usually it does. Unless it doesn’t. Welcome to the world of the cross-over trial!

The garden variety randomized trial is a parallel or concurrent trial: people get randomized to one of 2 or more groups, and they continue on their parallel tracks, at the same time. At the end of it, if all goes well, you have solid answers to the main question or questions you set out to resolve.

In a cross-over trial, on the other hand, people start off in one group, then along the way, each group of people swaps over with those in another group. Everyone gets the same options, they are just randomized to going through them in a different order – intervention A then B, or intervention B then A. That’s how the guy in the cartoon can be in both the vaccine and the placebo group.

Let's start with the advantage of doing trials like this. A crossover trial means each person is their own control. With that one move, you have removed a common reason for differences in outcomes – individuals' differences. And that means you need fewer people to get an answer.

Every health intervention question can't be answered this way of course – think surgery versus antibiotics for appendicitis, for example, or a drug that isn't going to leave your body to revert back to your usual state during the break between interventions ("wash-out period").

But what about our guy in a vaccine trial, though? They don't fit this picture, do they? Vaccines may wash out, but the benefits to your immune system of recognizing its enemy sure isn't supposed to!

Crossover trials for vaccines are in the spotlight because they're being used for Covid-19 vaccine trials. I discuss this in depth over at Absolutely Maybe – and for more technical discussion on this, see this preprint on the thinking behind the proposal, and Steve Goodman's slides for the US Food and Drug Administration's deliberations.

The crossover extensions of the Covid vaccine trials can't do everything a randomized controlled trial can do, but they can provide valuable data on some issues, especially if the people stay blinded. High amongst those is how long immunity lasts. That's because you now have one group that was vaccinated early, and one group who had deferred vaccination. After the crossover, if the infection rate between the groups stays the same, you know the early-vaccinated group's immunity isn't waning.

Back to the average crossover trial, though, which will be of treatments. What should look out for with those?

One problem is if the groups before the crossover are treated as though they are parallel trials. That's risky. Randomizing enough people to a parallel trial means you don't have to worry about differences between the individuals skewing the results – you don't have that when you're randomizing the order of interventions, not the people.

You also have to keep in mind what possible influence could the previous intervention have had. If the trial goes on for a while, then you have to consider whether the different time periods are now a factor – and more people might have dropped out before they had the second intervention, too.

And 2 final bonus points: "N of 1" trials are cross-overs. That's when you are trying out treatments in a formally structured way, though like all cross-over trials, it only works in some situations. (A quick look at those here at Statistically Funny.) And there's another kind of trial where people are controls for themselves: (Here's my quick look at those.)

Hilda Bastian

March 2021

To learn more about crossover trials, check out Stephen Senn's book, Cross-Over Trials in Clinical Research. This link will help you find it in a library near you.

Clinical Trials - More Blinding, Less Worry!

2018-08-12T19:30:00.001-04:00

She's right to be worried! There are so many possible cracks that bias can seep through, nudging clinical trial results off course. Some of the biggest come from people knowing which comparison group a participant will be, or has been, in. Allocation concealment and blinding are strategies to reduce this risk.

Before we get to that, let's look at the source of the problems we're aiming at here: people! They bring subjectivity to the mix, even if they are committed to the trial - and not everyone who plays a role will be supportive, anyway. On top of that, randomizing people - leaving their fate to pure chance - can be the rational and absolutely vital thing to do. But it's also "anathema to the human spirit", so it can be awfully hard to play totally by the rules.

And we're counting on a lot of people here, aren't we? There are the ones who enter an individual into one of the comparison groups in the trial. There are those individual participants themselves, and the ones dealing with them during the trial - healthcare practitioners who treat them, for example. And then there are the people measuring outcomes - like looking at an x-ray and deciding if it's showing improvement or not.

What could possibly go wrong?!

Plenty, it turns out. Trials that don't have good guard rails for concealing group allocation and then blinding it are likely to exaggerate the benefits of health treatments (meta-research on this here and here).

Let's start with allocation concealment. It's critical to successfully randomizing would-be trial participants. When it's done properly, the person adding a participant to a trial has no idea which comparison group that particular person will end up in. So they can't tip the scales out of whack by, say, skipping patients they think wouldn't do well on a treatment, when that treatment is the next slot to allocate.

Some allocation methods make it easy to succumb to the temptation to crack the system. When allocation is done using sealed envelopes, people have admitted to opening the envelopes till they get the one they want - and even going to the radiology department to use a special lamp to see through an opaque envelope, and breaking into a researcher's office to hunt for info! Others have kept logs to try to detect patterns and predict what the next allocation is going to be.

This happens more often than you might think. A study in 2017 compared sealed envelopes with a system where you have to ring the trial coordinating center to get the allocation. There were 28 clinicians - all surgeons - allocating their patients in this trial. The result:

With the sealed envelopes, the randomisation process was corrupted for patients recruited from three clinicians.

But there was an overall difference in the ages of people allocated in the whole "sealed envelope" period, too - so some of the others must have peeked now and then, too.

Messing with allocation was one of the problems that led to a famous trial of the Mediterranean diet being retracted recently. (I wrote about this at Absolutely Maybe and for the BMJ.) Here's what happened, via a report from Gina Kolata (New York Times):

A researcher at one of the 11 clinical centers in the trial worked in small villages. Participants there complained that some neighbors were receiving free olive oil, while they got only nuts or inexpensive gifts.

So the investigator decided to give everyone in the same village the same diet. He never told the leaders of the study what he had done.

"He did not think it was important"....

But it was: it was obvious on statistical analysis that the groups couldn't have been properly randomized.

The opportunities to mess up the objectivity of a trial by knowing the allocated group don't end with the randomization. Clinicians could treat people differently, thinking extra care and additional interventions are necessary for people in some groups, or being quicker to encourage people in one group to pull out of the trial. They might be more or less eager to diagnose problems, or judge an outcome measure differently.

Participants can do the equivalent of all this, too, when they know what group they are in - seek other additional treatments, be more alert to adverse effects, and so on. Ken Schulz lists potential ways clinicians and participants could change the course of a trial here, in Panel 1.

There's no way of completely preventing bias in a trial, of course. And you can't always blind people to participants' allocation when there's no good placebo, for example. But here are 3 relevant pillars of bias minimization to always look for when you want to judge the reliability of a trial's outcomes:

Adequate concealment of allocation at the front end;
Blinding of participants and others dealing with them during the trial; and
Blinding of outcome assessors - the people measuring or judging outcomes.

Pro tip: Go past the words people use (like "double blind") to see who was being blinded, and what they actually did to try to achieve it. You need to know "Who knew what and when?", not just what label the researchers put on it.

More on blinding here at Statistically Funny

6 Tips for Deciphering Outcomes in Health Studies at Absolutely Maybe.

Interested in learning more detail about these practices and their history? There's a great essay about the evolution of "allocation concealment" at the James Lind Library.

A Science Fortune Cookie

2017-12-04T07:59:00.002-05:00

This fortune cookie could start a few scuffles. It's offering a cheerful scenario if you are looking for a benefit of a treatment, for example. But it sure would suck if you are measuring a harm! That's not what's contentious about it, though.

It's the p values and their size that can get things very heated. The p value is the result you get from a standard test for statistical significance. It can't tell you if a hypothesis is true or not, or rule out coincidence. What it can do is measure an actual result against a theoretical expectation, and let you know if this is pretty much what you would expect to see if a hypothesis is true. The smaller it is, the better: statistical significance is high when the p value is low. Statistical hypothesis testing is all a bit Alice-in-Wonderland!

As if it wasn't already complicated enough, people have been dividing rapidly into camps on p values lately. The p value has defenders - we shouldn't dump on the test, just because people misuse it, they say (here). Then there are those who think it should be abandoned or at least very heavily demoted (here and here, for example).

Then there is the camp in favor of raising the bar by lowering the level for p values. In September 2017, a bunch of heavy-hitters say the time has come to expect p values to be much tinier, at least when something new is claimed (here).

How tiny are they saying a p should be? The usual threshold has been p <0.05 (less than 5%). Instead of that being a significant finding, they decided, just a bit less than 0.05 should only be called "suggestive" of a significant finding. A significant new finding should be way tinier: <0.005.

That camp reckons support for this change has reached critical mass. Which is suggestive of the <0.05 threshold going the way of the dodo. I have no idea what the fortune cookie on that says! (If you want to read more on avoiding p value potholes, check out my 5 tips on Absolutely Maybe.)

Now let's get back to the core message of our fortune cookie: the size of a p value is a completely separate issue from the size of the effect. That's because the size of a p value is heavily affected by the size of the study. You can have a highly statistically significant p value for a difference of no real consequence.

There's another trap: an important effect might be real, but the study was too small to know for sure. Here's an example. It's a clinical trial of getting people to watch a video about clinical trials, before going through the standard informed consent process to join a hypothetical clinical trial. The control group went through the same consent process, but without the video.

The researchers looked for possible effects on a particular misconception, and on willingness to sign up for a trial. They concluded this (I added the bold):

An enhanced educational intervention augmenting traditional informed consent led to a meaningful reduction in therapeutic misconception without a statistically significant change in willingness to enroll in hypothetical clinical trials.

You need to look carefully when you see statements like this one. You might not be getting an accurate impression. Later, the researchers report:

[T]his study was powered to detect a difference in therapeutic misconception score but not willingness to participate.

That means they worked out how many people they needed to recruit based only on what was needed to detect a difference of several points in the average misconception scores. Willingness to join a trial dropped by a few percentage points, but the difference wasn't statistically significant. That could mean it doesn't really reduce willingness - or it could mean the study was too small to answer the question. There's just a big question mark: this video reduced misconception, and a reduction in willingness to participate can't be ruled out.

What about the effect size? That is how big (or little) the difference between groups is. There are many different ways to measure it. For example, in this trial, "willingness to participate" was simply the proportion of people who said "yes" or "no".

However, the difference in "misconception" in that trial was measured by comparing mean results people scored on a test of their understanding. You can brush up on means, and how that leads you to standard deviations and standardized mean differences here at Statistically Funny.

There are other specific techniques used to set levels of what effect size matters - but those are for another day. In the meantime, there's a technical article explaining important clinical differences here. And another on Cohen's d, a measure that is often used in psychological studies. It comes with this rule of thumb: 0.2 is a small effect, 0.5 is medium, and 0.8 is a large effect.

Study reports should allow you to come to your own judgment about whether an effect matters or not. May the next research report you read be written by people who make that easy!

Number needed to confuse: read more at Statistically Funny on the objectivity - or not! - in ways of communicating about effects.

The Highs and Lows of the "Good Study"

2016-09-11T15:25:00.004-04:00

Imagine if weather reports only gave the expected average temperature across a whole country. You wouldn't want to be counting on that information when you were packing for a trip to Alaska or Hawaii, would you?

Yet that's what reports about the strength of scientific results typically do. They will give you some indication of how "good" the whole study is: and leave you with the misleading impression that the "goodness" applies to every result.

Of course, there are some quality criteria that apply to the whole of a study, and affect everything in it. Say I send out a survey to 100 people and only 20 people fill it in. That low response rate affects the study as a whole.

You can't just think about the quality of a study, though. You have to think about the quality of each result within that study. The likelihood is, the reliability of data will vary a lot.

For example, that imaginary survey could find that 25% of people said yes, they ate ice cream every week last month. That's going to be more reliable data than the answer to a question about how many times a week they ate ice cream 10 years ago. And it's likely to be less reliable than their answers to the question, "What year were you born?"

Then there's the question of missing data. Recently I wrote about bias in studies on the careers of women and men in science. A major data set people often analyze is a survey of people awarded PhDs in the United States. Around 90% of people answer it.

But within that, the rate of missing data for marital status can be around 10%, while questions on children can go unanswered 4 times as often. Conclusions based on what proportion of people with PhDs in physics are professors will be more reliable than conclusions on how many people with both PhDs in physics and school-age children are professors.

One of the most misleading areas of all for this are the abstracts and news reports of meta-analyses and systematic reviews. It will often sound really impressive: they'll tell you how many studies, and maybe how many people are in them, too. You could get the impression then, that this means all the results they tell you about have that weight behind them. The standard-setting group behind systematic review reporting says you shouldn't do that: you should make it clear with each result. (Disclosure: I was part of that group).

This is a really big deal. It's unusual for every single study to ask exactly the same questions, and gather exactly the same data, in exactly the same way. And of course that's what you need to be able to pool their answers into a single result. So the results of meta-analyses very often draw on a subset of the studies. It might be a big subset, but it might be tiny.

To show you the problem, I did a search this morning at the New York Times for "meta-analysis". I picked the first example of a journalist reporting on specific results of a meta-analysis of health studies. It's this one: about whether being overweight or obese affects your chances of surviving breast cancer. Here's what the journalist, Roni Caryn Rabin wrote - and it's very typical:

"Just two years ago, a meta-analysis crunched the numbers from more than 80 studies involving more than 200,000 women with breast cancer, and reported that women who were obese when diagnosed had a 41 percent greater risk of death, while women who were overweight but whose body mass index was under 30 had a 7 percent greater risk".

There really was not much of a chance that all the studies had data on that - even though you would be forgiven for thinking that when you looked at the abstract. And sure enough, this is how it works out when you dig in:

There were 82 studies and the authors ran 31 basic meta-analyses;
The meta-analytic result with the most studies in it included 24 out of the 82;
84% of those results combined 20 or fewer studies - and 58% had 10 or less. Sometimes only 1 or 2 studies had data on a question;
The 2 results the New York Times reported came from about 25% of the studies and less than 20% of the women with breast cancer.

The risk data given in the study's abstract and the New York Times report did not come from "more than 200,000 women with breast cancer". One came from over 42,000 women and the other from over 44,000. In this case, still a lot. Often, it doesn't work that out way, though.

So be very careful when you think, "this is a good study". That's a big trap. It's not just that all studies aren't equally reliable. The strength and quality of evidence almost always varies within a study.

Want to read more about this?

Here's an overview of the GRADE system for grading the strength of evidence about the effects of health care.

I've written more about why it's risky to judge a study by its abstract at Absolutely Maybe.

And here's my quick introduction to meta-analysis.

Cupid's Lesser-Known Arrow

2016-08-14T18:41:00.003-04:00

Cupid's famous arrow causes people to fall blindly in love with each other. That can end happily ever after. Not so with his lesser known "immortal time bias" arrow! That one causes researchers to fall blindly in love with profoundly flawed results - and that never ends well.

This type of time-dependent bias often afflicts observational studies. It's a particular curse for those studies relying on the "big data" from medical records. A recent study found close to 40% of susceptible studies in prominent medical journals were "biased upward by 10% or more". A study in 2011 found that 62% of studies of postoperative radiotherapy didn't safeguard against immortal time bias. That could make treatment look more effective than it really is.

So what is it? It's a stretch of time where an outcome couldn't possibly occur for one group - and that gives them a head start over another group. Samy Suissa describes a classic case from the early days of heart transplantation in the 1970s. A 1971 study showed 20 people who had heart transplants at Stanford lived an average of 200 days compared to 14 transplant candidates who didn't get them and survived an average of 34 days.

Those researchers had started the clock from the point at which all 34 people had been accepted into the program. Now of course, all the people who got the transplants were alive at the time of surgery. For the stretch of time they were on the waiting list, they were "immortal": you could not die and still get a heart transplant. So when people on the waiting list died early, they were in the no-transplant group.

When the data were re-analyzed by others in 1974 to factor this into account, the survival advantage of the operation disappeared. (More about the history in Hanley and Foster's article, Avoiding blunders involving 'immortal time'.)

This bias is also called survivor or survival bias, or survivor treatment selection bias. But time-dependent biases don't only affect death as an outcome. It can affect any outcome, not just death. So "immortal time" isn't really the best term. Hanley and Foster call it event-free time.

Carl von Walraven and colleagues are among the group that call this kind of phenomenon "competing risk bias":

Competing risks are events whose occurrence precludes the outcome of interest.

They are the authors of the 2016 study I mentioned above about how common the problem is. They show the impact on data in a study they did themselves on patient discharge summaries.

If you were re-admitted to hospital before you got to a physician visit with your discharge summary, you didn't fare as well as the people who went to the doctor. If you just compare the group who went to the physician for follow-up as the hospital encouraged with the group who didn't, the group who didn't visit their doctor had way higher re-admission rates. Not much surprise there, eh?

Von Walraven says the risk grew as people started to do more time-to-event studies. They put the problem down partly to the popularity of a method for survival ratios that doesn't recognize these risks in its basic analyses. That's Kaplan-Meier risk estimation. You see Kaplan-Meier curves referred to a lot in medical journals.

Although they're called curves, I think they look more like staircases. Here's an example: number of months survived here starts off the same, but gets better for the blue line after a year, plateauing a couple of years later.

Some common statistical programs don't have a way to deal with time-dependent calculations in Kaplan-Meier analyses, according to von Walraven. You need extensions of the programs to handle some data properly. The Royal Statistical Society points to this problem too, in the description for their 2-day course on Survival Analysis. (One's coming up in London in September 2016.)

Hanley and Foster have a great guide to recognizing immortal time bias (Table 1, page 956). The key, they say, is to "Think person-time, not person":

If authors used the term 'group', ask... When and how did persons enter a 'group'? Does being in or moving to a group have a time-related requirement?

Given the problem is so common, we have to be very careful when we read observational studies with time-to-event outcomes and survival analyses. If authors talk about cumulative risk analyses and accounting for time-dependent measures, that's reassuring.

But what we really need is for the people who do these studies - and all the information gatekeepers, from peer reviewers to journalists - to learn how to dodge this arrow.

More reading on a somewhat lighter note: my post at Absolutely Maybe on whether winning awards or elections affects longevity.

~~~~

The Kaplan-Meier "curve" image was chosen without consideration of its data or the article in which it appears. I used the National Library of Medicine's Open i images database, and erased explanatory details to focus only on the "curve". The source is an article by Kadera BE et al (2013) in PLOS One.

More Than Average Confusion About What Mean Means Mean

2015-11-29T10:51:00.005-05:00

She's right: on average, when people talk about "average" for a number, they mean the mean.

The mean is the number we're talking about when we "even out" a bunch of numbers into a single number: 2 + 3 + 4 equals 9. Divide that total by 3 - the number of numbers in that set - and you get the mean: 3.

But then you hear people make that joke about "almost half the people being below average" - and that's not the mean any more. That's a different average. It's the median - the number in the middle. It comes from the Latin word for "in the middle", just like the word medium. That's why we call the line that runs down the middle of a road the median strip, too.

If the numbers in a group are all pretty close to each other - like our example here, or, say, the ages of everyone in a class at school - then there's not much difference between the mean and median.

But if the numbers in a group are wildly far apart - the ages of the people who like Star Wars movies, for example, or whose favorite singer is Frank Sinatra - then it can make a very big difference. Even if Strangers In The Night had enough of a resurgence to drag the average age of Ol' Blue Eyes listeners down, the big Sinatra fan base would still skew older!

How far apart numbers in a dataset are spread from each other is called variance: if the numbers bunch up in the middle, the variance is small. And understanding or dealing with variance is where we start to head in the direction of, well, sort of means of means.

The distance of a piece of data from the group's mean is a great standard way to measure the spread. This is called the deviation from the mean. A measure called the standard deviation from the mean will be bigger when the numbers are more spread out. Lots of results will cluster within 1 standard deviation (SD), and most will be within 2 standard deviations. Roughly like this:

From here, it's a hop, skip to another calculation based on the mean that you often come across in health studies. It's a way to standardize the differences in means (average results) called the standardized mean difference (SMD).

The SMD needs to be used when outcomes have been measured in similar, but different, ways in groups that researchers are comparing.

There's a lot you can make sense of when you know what the means mean!

The SMD is calculated by dividing the differences in the mean in two groups by standard deviations. You can read more on standard deviations here at Statistically Funny.

Feel like testing your knowledge of the mean, median, and mode? (The mode is the number in a set that occurs the most often: so if our example had been 2 + 3 + 4 + 4, then the mode would have been 4.) Try the Khan Academy quiz.

Interested in the ancient roots of averages? Examples from Herodotus, Thucydides, and in Homer here (very academic).

Note: Edited to address broken links, on November 6, 2022.

Hilda Bastian

AGHAST! The Day the Trial Terminator Arrived

2015-09-30T07:42:00.001-04:00

Clinical trials are complicated enough when everything goes pretty much as expected. When it doesn't, the dilemma of continuing or stopping can be excruciatingly difficult. Some of the greatest dramas in clinical research are going on behind the scenes around this. Even who gets to call the shot can be bitterly disputed.

A trial starts with a plan for how many people have to be recruited to get an answer to the study's questions. This is calculated based on what's known about the chances of benefits and harms, and how to measure them.

Often a lot is known about all of this. Take a trial of antibiotics, for example. How many people will end up with gastrointestinal upsets is fairly predictable. But often the picture is so sketchy it's not much more than a stab in the dark.

Not being sure of the answers to the study's questions is an ethical prerequisite for doing clinical trials. That's called equipoise. The term was coined by lawyer Charles Fried in his 1974 book, Medical Experimentation. He argued that the investigator should be uncertain if people were going to be randomized. In 1987, Benjamin Freedman argued the case for clinical equipoise: that we need professional uncertainty, not necessarily individual uncertainty.

It's hard enough to agree if there's uncertainty at any time! But the ground can shift gradually, or even dramatically, while a trial is chugging along.

I think it's helpful to think of this in 2 ways: a shift in knowledge caused by the experience in the trial, and external reasons.

Internal issues that can put the continuation of the trial in question include:

Not being able to recruit enough people to participate (by far the most common reason);
More serious and/or frequent harm than expected tips the balance;
Benefits much greater than expected;
The trial turns out to be futile: the differences in outcome between groups is so small, even if the trial runs its course, we'll be none the wiser (PDF).

External developments that throw things up in the air or put the cat among the pigeons include:

A new study or other data about benefits or safety - especially if it's from another similar trial;
Pressure from groups who don't believe the trial is justified or ethical;
Commercial reasons - a manufacturer is pulling the plug on developing the product it's trialing, or just can't afford the trial's upkeep;
Opportunity costs for public research sponsors has been argued as a reason to pull the plug for possible futility, too.

Sometimes several of those things happen at once. Stories about several examples are in a companion post to this one over at Absolutely Maybe. They show just how difficult these decisions are - and the mess that stopping a trial can leave behind.

Trials that involve the risk of harm to participants should have a plan for monitoring the progress of the trial without jeopardizing the trial's integrity. Blinding or masking the people assessing outcomes and running the trial is a key part of trial methodology (more about that here). Messing with that, or dipping into the data often, could end up leading everyone astray. Establishing stopping rules before the trial begins is the safeguard used against that - along with a committee of people other than the trial's investigators monitoring interim results.

Although they're called stopping "rules", they're actually more guideline than rule. And other than having it done independently of the investigators, there is no one widely agreed way to do it - including the role of the sponsors and their access to interim data.

Some methods focus on choosing a one-size-fits-all threshold for the data in the study, while others are more Bayesian - taking external data into account. There is a detailed look at this in a 2005 systematic review of trial data monitoring processes by Adrian Grant and colleagues for the UK's National Institute of Health Research (NIHR). They concluded there is no strong evidence that the data should stay blinded for the data monitoring committee.

A 2006 analysis HIV/AIDS trials stopped early because of harm, found that only 1 out of 10 had established a rule for this before the trial began but it's more common these days. A 2010 review of trials stopped early because the benefits were greater than expected found that 70% mentioned a data monitoring committee (DMC). (These can also be called data and safety monitoring boards (DSMBs) or data monitoring and ethics committees (DMECs).)

Despite my cartoon of data monitoring police, DMCs are only advisors to the people running the trial. They're not responsible for the interpretation of a trial's results, and what they do generally remains confidential. Who other than the DMC gets to see interim data, and when, is a debate that can get very heated.

Clinical trials only started to become common in the 1970s. Richard Stephens writes that it was only in the 1980s, though, that keeping trial results confidential while the trial is underway became the expected practice. In some circumstances, Stephens and his colleagues argue, publicly releasing interim results while the trial is still going on can be a good idea. They talk about examples where the release of interim results saved trials that would have foundered because of lack of recruitment from clinicians who didn't believe the trial was necessary.

One approach when there's not enough knowledge to make reliable trial design decisions is a type of trial called an adaptive trial. It's designed to run in steps, based on what's learned. About 1 in 4 might adapt the trial in some way (PDF). It's relatively early days for those.

In the end, no matter which processes are used, weighing up the interests of the people in the trial, with the interests of everyone else in the future who could benefit from more data, will be irreducibly tough. Steven Goodman writes that we need more people with enough understanding and experience of the statistics and dilemmas involved in data monitoring committees.

We also need to know more about when and how to bring people participating in the trial into the loop - including having community representation on DMCs. Informing participants at key points more would means some will leave. But most might stay, as they did in the Women's Health Initiative hormone therapy trials (PDF) and one of the AZT trials in the earlier years of the HIV epidemic.

There is one clearcut issue here. And that's the need to release the results of any trial when it's over, regardless of how or why it ended. That's a clear ethical obligation to the people who participated in the trial - the desire to advance knowledge and help others is one of the reasons many people agree to participate. (More on this at the All Trials campaign.)

More at Absolutely Maybe: The Mess That Trials Stopped Early Can Leave Behind

~~~~

Trial acronyms: If someone really did try to make an artificial gallbladder - not to mention actually start a trial on it! - I think lots of us would be pretty aghast! But a lot of us are pretty aghast about the mania for trial acronyms too. More on that here at Statistically Funny.

ARR OR NNT? What's Your Number Needed To Confuse?

2015-07-19T16:28:00.003-04:00

I used to think numbers are completely objective. Words, on the other hand, can clearly stretch out, or squeeze, people's perceptions of size. "OMG that spider is HUGE!" "Where? What - that little thing?"

Yes, numbers can be more objective than words. Take adverse effects of health care: if you use the word "common" or "rare", people won't get as accurate an impression as if you use numbers.

But that doesn't mean numbers are completely objective. Or even that numbers are always better than words. Numbers get a bit elastic in our minds, too.

We're mostly good at sizing up the kinds of quantities that we encounter in real life. For example, it's pretty easy to imagine a group of 20 people going to the movies. We can conceive pretty clearly what it means if 18 say they were on the edge of the seats the whole time.

There's an evolutionary theory about this, called ecological rationality. The idea is, our ability to reason with quantities developed in response to the quantities around us that we frequently need to mentally process. (More on this in Brase [PDF] and Gigerenzer and Hoffman [PDF].)

Whatever the reason, we're just not as good at calibrating risks that are lower frequency (Yamagishi [PDF]). We're going to get our heads around 18 out 20 well. But 18000 out of 200000? Not so much. We'll do pretty well at 1 out of 10, or 1 out of 100 though.

And big time trouble starts if we're reading something where the denominators are jumping around - either toggling from percent to per thousand and back, or saying "7 out of 13 thought the movie was great, while 4 out of 19 thought it was too scary, and 9 out of 17 wished they had gone to another movie". We'll come back to this in a minute. But first, let's talk about some key statistics used to communicate the effects of health care.

Statistics - where words and numbers combine to create a fresh sort of hell!

First there's the problem of the elasticity in the way our minds process the statistics. That means that whether they realize it or not, communicators' choice of statistic can be manipulative. Then there's the confusion created when people communicate statistics with words that get the statistics wrong.

Let's look at some common measures of effect sizes: absolute risk (AR), relative risk (RR), odds ratio (OR), and number needed to treat (NNT). (The evidence I draw on is summarized in my long post here.)

Natural frequencies are the easiest thing for people generally to understand. And getting more practice with natural frequencies might help us to get better at reasoning with numbers, too (Gigerenzer again [PDF]).

Take our movie-goers again. Say that 6 of the 20 were hyped-up before the movie even started. And 18 were hyped-up afterwards. Those are natural frequencies. If I give you those "before and after" numbers in percentages, that's "absolute risk" (AR). Lots of people (but not everybody) can manage the standardization of percentages well.

But if I use relative risks (RR) - people were 3 times as likely to be hyped-up after seeing that movie - then the all-important context of proportion is lost. That's going to sound like a lot, whether it's a tiny difference or a huge difference. People will often react to that without stopping to check, "yes, but from what to what?" From 6 to 18 out of 20 is a big difference. But going from 1 out of a gazillion to 3 out of a gazillion just ain't much worth crowing or worrying about.

RRs are critically important: they're needed for calculating a personalized risk if you're not at the same risk as the people in a study, for example. But if it's the only number you look at, you can get an exaggerated idea.

So sticking with absolute risks or natural frequencies, and making sure the baseline is clear (the "before" number), is better at helping people understand an effect. Then they can put their own values on it.

The number needed to treat, takes the change in absolute change and turns it upside down. (Instead of calculating the difference out of 100, it's 100 divided by the difference.) So that instead of the constant denominator of 100, you now have ones that change: instead of 60% of people being hyped-up because of the movie, it becomes NNT 1.7 (1.7 people have to see the movie for 1 person to get hyped-up).

This can be great in some circumstances, and many people are really used to NNTs. But on average, this is one of the hardest effect measures to understand. Which means that it's easier to be manipulated by it.

NNT is the anti-RR if you like: RRs exaggerate, NNTs minimize. Both can mislead - and that can be unintentional or deliberate.

When it comes to communicating with people who need to use results, I think using only statistics that will frequently mislead because it's a preference of the communicator is paternalistic, because it denies people the right to an impression based on their own values. Like all forms of paternalism, that's sometimes justified. But there's a problem when it becomes the norm.

The NNT was developed in the 1990s [PDF]. It was meant to do a few things - including counteracting the exaggeration of the RR. Turns out it overshot the mark there! It was also intended to be easier to understand than the odds ratio (OR).

The OR brings us to the crux of the language problems. People use words like odds, risks, and chances interchangeably. Aaarrrggghhh!

A risk in statistics is what we think of as our chances of being in the group: a 60% absolute risk means a 60 in 100 (or 6 in 10) "chance".

An odds ratio in statistics is like odds in horse-racing and other gambling. It factors in both the odds of "winning" versus the odds of "losing". (If you want to really get your head around this, check out Know Your Chances by Woloshin, Schwartz, and Welch. It's a book that's been shown in trials to work!)

The odds ratio is a complicated thing to understand, especially if it's embedded in confusing language. It's a very sound way to deal with data from some types of studies, though. So you see odds ratios a lot in meta-analyses. (If you're stumped about getting a sense of proportion in a meta-analysis, look at the number of events and the number of participants - they are the natural frequencies.)

There's one problem that all of these ways of portraying risks/chances have in common: when people start putting them in sentences, they frequently get the language wrong. So they can end up communicating something entirely other than what was intended. You really need to double-check exactly what the number is, if you want to protect yourself from getting the wrong impression.

OK, then, so what about "pictures" to portray numbers? Can that get us past the problems of words and numbers? Graphs, smile-y versus frown-y faces, and the like? Many think this is "the" answer. But...

This is going to be useful in some circumstances, misleading in others. Gerd Gigerenzer and Adrian Edwards: "Pictorial representations of risk are not immune to manipulation either". (A topic for another time, although I deal with it a little in the "5 shortcuts" post listed below.)

Where does all this leave us? Few researchers reporting data have the time to invest in keeping up with the literature on communicating numbers - so while we can plug away at improving the quality of reporting of statistics, there's no overnight solution there.

Getting the hang of the common statistics yourself is one way. But the two most useful all-purpose strategies could involve detecting bias.

One is to sharpen your skills at detecting people's ideological biases and use of spin. Be on full alert when you can see someone is utterly convinced and trying to persuade you with all their chips on a particular way of looking at data - especially if it's data on a single outcome. If the question matters to you, beware of the too-simple answer.

The second? Be on full alert when you see something you really want, or don't want, to believe. The biggest bias we have to deal with is our own.

More of my posts relevant to this theme:

Does It Work? Beware of the Too-Simple Answer

At Absolutely Maybe (PLOS Blogs):
5 Shortcuts to Keep Data on Risks in Perspective
Mind your "p"s, RRs, and NNTs: On Good Statistics Behavior

At Third Opinion (MedPage Today):
The Trouble With Evidence-Based Medicine, the 'Brand'
The NNT: An Overhyped and Confusing Statistic

Check out hildabastian.net for a running summary of what I'm writing about.

Let's Play Outcome Mash-up - A Clinical Trial Shortcut Classic!

2015-02-08T12:17:00.002-05:00

Deciphering trial outcomes can be a tricky business. As if many measures aren't hard enough to make sense of on their own, they are often combined in a complex maneuver called a composite endpoint (CEP) or composite outcome. The composite is treated as a single outcome. And journalists often phrase these outcomes in ways that give the impression that each of the separate components has improved.

Here's an example from the New York Times, reporting on the results of a major trial from the last American Heart Association conference:

"There were 6.4% fewer cardiac events - heart disease deaths, heart attacks, strokes, bypass surgeries, stent insertions and hospitalization for severe chest pain..."

That individual statement sounds like the drug reduced deaths, bypasses, stents, and hospitalization for unstable angina, doesn't it? But it didn't. The modest effect was on non-fatal heart attacks and stroke only.*

CEPs are increasingly common: by 2007, well over a third of cardiovascular trials were using them. CEPs are a clinical trial shortcut because you need fewer people and less time to hit a jackpot. A trial's main pile of chips is riding on its pre-specified primary outcome: the one that answers the trial's central, most important question.

The primary outcome determines the size and length of the trial, too. For example, if the most important outcome for a chronic disease treatment is to increase the length of people's lives, you would need a lot of people to get enough events to count (the event in this case would be death). And it would take years to get enough of those events to see if there's anything other than a dramatic, sudden difference.

But if you combine it with one or more other outcomes - like non-fatal heart attacks and strokes - you'll get enough events much more quickly. Put in lots, and you're really hedging your bets.

It's a very valuable statistical technique - but it can go haywire. Say you have 3 very serious outcomes that happen about as often as each other - but then you add another component that is less serious and much more common. The number of less serious events can swamp the others. Everything could even be riding on only one less serious component. But the CEP has a very impressive name - like "serious cardiac events." Appearances can be deceptive.

Enough data on the nature of the events in a CEP should be clearly reported so that this is obvious, but it often isn't. And even if the component events are reported deep in the study's detail, don't be surprised if it's not pointed out in the abstract, press release, and publicity!

There are several different ways a composite can be constructed, including use of techniques like weighting that need to be transparent. Because it's combining events, there has to be a way of dealing with what happens when more than one event happens to one person - and that's not always done the same way. The definitions might make it obvious, the most serious event might count first according to a hierarchy, or the one that happened to a person first might be counted. But exactly what's happening often won't be clear - maybe even most of the time.

There's agreement on some things you should look out for (see for example Montori, Hilden, and Rauch). Are each of the components as serious as each other and/or likely to increase (or decrease) together in much the same way? If one's getting worse and one's getting better, this isn't really measuring one impact.

The biggest worry, though, is when researchers play the slot machine in my cartoon (what we call the pokies, "Downunder"). I've stressed the dangers of hunting over and over for a statistical association (here and here). The analysis by Lim and colleagues found some suggestion that component outcomes are sometimes selected to rig the outcome. If it wasn't the pre-specified primary outcome, and it wasn't specified in the original entry for it in a trials register, that's a worry. Then it wasn't really a tested hypothesis - it's a new hypothesis.

Composite endpoints, properly constructed, reported, and interpreted are essential to getting us decent answers to many questions about treatments. Combining death with serious non-fatal events makes it clear when there's a drop in an outcome largely because people died before that could happen, for example. But you have to be very careful once so much is compacted into one little data blob.

(Check out slide 14 to see the forest plot of results for the individual components the journalist was reporting on. Forest plots are explained here at Statistically Funny.)

More on understanding clinical trial outcomes:

Another way to get clinical trial results quickly: surrogates and biomarkers (at Statistically Funny)
Keeping risks in perspective (at Absolutely Maybe)

New this week: I'm delighted to now have a third blog, one for physicians with the wonderful team at MedPage Today. It's called Third Opinion.

Biomarkers Unlimited: Accept Only OUR Substitutes!

2014-11-30T23:24:00.003-05:00

Sounds great, doesn't it? Getting clinical trial results quickly has so much going for it. Information sooner! More affordable trials!

Substituting outcomes that can take years, or even decades, to emerge, with ones you can measure much earlier, makes clinical research much simpler. This kind of substitute outcome is called a surrogate (or intermediate) endpoint or outcome.

Surrogates are often biomarkers - biological signs of disease or a risk factor of disease, like cholesterol in the blood. They are used in clinical care to test for, or keep track of, signs of emerging or progressing disease. Sometimes, like cholesterol, they're the target of treatment.

The problem is, these kinds of substitute measures aren't always reliable. And sometimes we find that out in the hardest possible way.

The risk was recognized as soon as the current methodology of clinical trials was being developed in the 1950s. Austin Bradford Hill, who played a leading role, put it bluntly: if the "rate falls, the pulse is steady, and the blood pressure impeccable, we are still not much better off if unfortunately the patient dies."

That famously happened with some drugs that controlled cardiac arrhythmia - irregular heartbeat that increases the chances of having a heart attack. On the basis of ECG tests that showed the heartbeat was regular, these drugs were prescribed for years before a trial showed that they were causing tens of thousands of premature deaths, not preventing them. That kind of problem has happened too often for comfort.

It happened again this week - although at least before the drug was ever approved. A drug company canceled all its trials for advanced gastric (stomach) cancer of a new drug. The drug is called Rilotumumab. Back in January, it was a "promising" treatment, billed as bringing "new hope in gastric cancer." It got through the early testing phases and was in Phase III trials - the kind needed to get FDA approval.

But one phase III trial, RILOMET-1, quickly showed an increase in the number of deaths in people using the drug. We don't know how many yet - but it was enough for the company to decide to end all trials of the substance.

This drug targets a biomarker associated with worse disease outcomes, an area seen by some as transforming gastric cancer research and treatment. Others see considerable challenges, though - and what happened to the participants in the RILOMET-1 trial underscores why.

There is a lot of controversy about surrogate outcomes - and debates about what's needed to show that an outcome or measure is a valid surrogate we can rely on. They can lead us to think that a treatment is more effective than it really is.

Yet a recent investigative report found that cancer drugs are being increasingly approved based only on surrogate outcomes, like "progression-free survival." That measures biomarker activity rather than overall survival (when people died).

It can be hard to recognize at first, what's a surrogate and what's an actual health outcome. One rule of thumb is, if you need a laboratory test of some kind, it's more likely to be a surrogate. Whereas symptoms of the disease you're concerned, or harm caused by the disease, are the direct outcomes of interest. Sometimes those are specified as"patient-relevant outcomes."

Many surrogate outcomes are incredibly important, of course - viral load for HIV treatment and trials for example. But in general, when clinical research results are based only on surrogates, the evidence just isn't as strong and reliable as it is for the outcomes we are really concerned about.

~~~~

See also, Statistically Funny on "promising" treatments.

Sheesh - what are those humans thinking?

2014-10-12T02:01:00.000-04:00

I can neither confirm nor deny that Cecil is now a participant in one of the there-is-no-limit-to-the-human-lifespan resveratrol studies at Harvard's "strictly guarded mouse lab"! If he is, I'm sure he's even more baffled by the humans' hype over there.

Resveratrol is the antioxidant in grapes that many believe makes drinking red wine healthy. And it's a good example of how research on animals is often terribly misleading and misinterpreted. I've written about it over at Absolutely Maybe if you're interested in a classic example of the rise and fall of animal-research-based hype (or more detail about resveratrol).

But this week, it's media hype about a study using human stem cells in mice in another lab at Harvard that's made me ratty. You could get the idea that a human trial of a "cure" for type 1 diabetes is just a matter of time now - and not a lot of time at that. According to the leader of the team, Doug Melton, "We are now just one preclinical step away from the finish line."

An effective treatment that ends the need for insulin injections would be incredibly exciting. But we see this kind of claim from laboratory research all the time, don't we? How often does it work out - even for the studies that are at "the finishing line" for animal studies?

Not all that often: maybe about a third of the time.

Bart van der Worp and colleagues wrote an excellent paper explaining why. It's not just that other animals are so different from humans. We're far less likely to hear of the failed animal results than we are of human trials that don't work out as hoped. That bias towards positive published results draws an over-optimistic picture.

As well as fundamental differences between species, van der Worp points to other common issues that reduce the applicability for humans of typical studies in other animals:

The animals tend to be younger and healthier than the humans who have the health problem;
They tend to be a small group of animals that are very similar to each other, while the humans with the problem are a large very varied group;
Only male or only female animals are often used; and
Doses higher than humans will be able to tolerate are generally used.

Limited genetic diversity could be an issue, too.

So how does the Harvard study fare on that score? They used stem cells to develop insulin-producing cells that appeared to function normally when transplanted into mice. But this was the very early stages. When it came to the test they reported on the ones with diabetes, there were only 6 (young) mice who got the transplants (and 1 died) (plus a comparison group). Gender was not reported - and as is common in laboratory animal studies, there wasn't lengthy follow-up. This was an important milestone, but there's a very long way to go here. Transplants in humans face a lot of obstacles.

Van der Worp points to another set of problems: inadequacies in research methods that we've learned over time in human research bias the proceedings too much - including problems with statistical analyses. Jennifer Hirst and colleagues have studied this too. They concluded that so many studies were bedeviled by issues such as lack of randomization and blinding by those assessing outcomes, that they should never have been regarded as being "the finishing line" before human experimentation at all.

There's good news though! CAMARADES is working to improve this - with the same approach for chipping away at these problems as in human trials: by slogging away at biased methodologies and publication bias. And pushing for good quality systematic reviews of animal studies before human trials are undertaken. It's well worth half an hour to watch the wonderful talk by Emily Sena at Evidence Live 2015.

Laboratory animal research may be called "preclinical," but even that jargon is a bit of over-optimistic marketing. Most of what's tried in the lab will never get near human trials. And when it does, it will mostly be disappointing. Laboratory research is needed, and encouraging progress is great. But people should definitely not be getting our hopes up too much about it.

~~~~

The National Institutes of Health (NIH) addressed the issue of gender in animal experiments earlier in 2014. After I wrote this post, the NIH also released proposed guidelines for reporting preclinical research.

Thanks to Jonathan Eisen for adding a link for the full text of the paper to PubMed Commons, as well as to a blog post by Paul Knoepfler discussing the context of the stem cell work by Felicia Pagliuca, Doug Melton and colleagues. NHS Behind the Headlines have also analyzed and explained this study.

Thanks to Jim Johnson for pointing an oversight: that animal studies - this one included - can also suffer from having too little follow-up.

Interest declaration: I'm an academic editor at one of the journals whose papers on animal research I commended (PLOS Medicine) and on the human ethics advisory group of another (PLOS One), but I had no involvement in either paper.

Update: Checked, post and cartoon refreshed, and link to Sena's talk at Evidence Live on 5 December 2015.

If at first you don't succeed...

2014-03-16T14:32:00.002-04:00

If only post hoc analyses always brought out the inner skeptic in us all! Or came with red flashing lights instead of just a little token "caution" sentence buried somewhere.

Post hoc analysis is when researchers go looking for patterns in data. (Post hoc is Latin for "after this.") Testing for statistically significant associations is not by itself a way to sort out the true from the false. (More about that here.) Still, many treat it as though it is - especially when they haven't been able to find a "significant" association, and turn to the bathwater to look for unexpected babies.

Even when researchers know the scientific rules and limitations, funny things happen along the way to a final research report. It's the problem of researchers' degrees of freedom: there's a lot of opportunity for picking and choosing, and changing horses mid-race. Researchers can succumb to the temptation of over-interpreting the value of what they're analyzing, with "convincing self-justification." (See the moving goalposts over time here, for example, as trialists are faced with results that didn't quite match their original expectations.)

And even if the researchers don't read too much into their own data, someone else will. That interpretation can quickly turn a statistical artifact into a "fact" for many people.

Let's look more closely at Significus' pet hate: post hoc analyses. There are dangers inherent in multiple testing when you don't have solid reasons for looking for a specific association. The more often you randomly dip into data without a well-founded target, the higher your chances of pulling out a result that will later prove to be a dud.

It's a little like fishing in a pond where there are random old shoes among the fish. The more often you throw your fishing line into the water, the greater your chances of snagging a shoe instead of a fish.

Here's a study designed to show this risk. The data tossed up significant associations such as: women were more likely to have a cesarean section if they preferred butter over margarine, or blue over black ink.

The problem is huge in areas where there's a lot of data to fish around in. For published genome-wide association studies, for example, over 90% of the "associations" with a disease couldn't consistently be found again. Often, researchers don't report how many tests were run before they found their "significant" results, which makes it impossible for others to know how big a problem multiple testing might be in their work.

The problem extends to subgroup analyses where there is not an established foundation for an association. The credibility of claims made on subgroups in trials is low. And it has serious consequences. For example, an early trial suggested only men with stroke-like symptoms benefit from aspirin - which stopped many doctors from prescribing aspirin to women.

How should you interpret post hoc and subgroup analyses then? If analyses were not pre-specified and based on established, plausible reasons for an association, then one study isn't enough to be sure.

With subgroups that weren't randomized as different arms of a trial, it's not enough that the average for one subgroup is higher than the average for another subgroup. There could be other factors influencing the outcome other than their membership of that subgroup. An interaction test is done to try to account for that.

It's more complicated when it's a meta-analysis, because there are so many differences between one study and another. The exception here is an individual patient data meta-analysis, which can study differences between patients directly.

In the end, it comes down to being careful not to see a new hypothesis generated by research as a "fact" already proven by the study from which it came.

Post hoc, ergo propter hoc. This description of basic faulty logic - "after this, therefore because of this" - is as ancient as the language that made it famous. We've had millennia to snap out of the dangerous mental shortcut of seeing a cause where there's only coincidence. Yet we still hurtle like lemmings over cliffs into its alluring clutches.

More on multiple testing at Statistically Funny.

What's so good about "early," anyway?

2013-12-29T16:05:00.001-05:00

"Early." It's one of those words like "new" and "fast," isn't it? As though they are inherently good, and their opposites - "late," "old" and "slow" - are somehow bad.

Believing in the value and virtue of being an early bird has deep roots in our cultural consciousness. It goes back at least as far as ancient Athens. Aristotle's treatise on household economics said that early rising was both virtuous and beneficial: "It is likewise well to rise before daybreak; for this contributes to health, wealth and wisdom."

But just as Gertrud came to suspect the benefits for her of being early weren't all they were cracked up to be, earliness isn't always better in other areas either. The "get in early!" assumption has an in-built tendency to lead us astray when it comes to detection of diseases and conditions. And even most physicians - just the people we often rely on to inform us - don't understand enough about the pitfalls that lead us to jump to conclusions about early detection too, well…early.

Pitfall number 1: Those who need it least get the most early detection

This one is a double-edged sword. Firstly, whether it's a screening program or research studying early detection, there tends to be a "worried well" or "healthy volunteer" effect (selection bias). It's easy to have higher than average rates of good health outcomes in people who are at low risk of bad ones anyway. This can lead to inflated perceptions of how much benefit is possible.

The other problem is an over-supply of fatalism among many people who may be able to materially benefit from early detection. Constant bombardment about all the things they could possibly be worrying about might even make it more likely that they shut out vital information - which could make it even more likely that they ignore symptoms, for example.

Pitfall number 2: Over-diagnosis from detecting people who would never have become ill from the condition detected

This one is called length bias. For many conditions, like cancers, there are dangerous ones that develop too quickly for a screening program to catch them. Early detection is actually better at detecting ones that may never threaten their health. More people die with cancer, than of it.

So early detection means many people are fighting heroic battles that were never necessary. And some will actually be harmed by parts of some screening processes that carry serious risks of their own (like colonoscopies), or adverse effects of the treatments they got which they didn't need.

Add to those the number of people who are diagnosed as being "at risk" of conditions they will never have or which would have resolved without treatment, and the number harmed is depressingly huge.

This massive swelling of the numbers of people who have survived phantoms is spreading the shadow of angst ever wider (a subject I've written about in relation to cancer at Scientific American). Spend 10 minutes or so listening to Iona Heath on this subject - starting just past 2 minutes on this video. [And read @Deevybee's important comment and links about developmental conditions in early childhood below.]

Pitfall number 3: The statistical effect that means survival rates "improve" even if no one's life expectancy increases

This is lead-time bias. And it's why you should always be careful when you see survival rates in connection with early detection and treatment. Screening programs, by definition, are for people who have no symptoms (pre-clinical). So they cut the part of your life where you don't know you have the disease short. Even if the earlier diagnosis made no difference to the length of your life, the amount of time you lived with knowledge of the disease (disease "survival") is longer.

What we want is to move the needle on length and/or quality of life. For that to happen, there has to be safe and effective treatment, safe and effective screening procedures, and more people found at a time they can be helped than would have come from diagnosing the condition when there symptoms.

Here's an example. This person's disease began when they were 40 years old. They lived without any problem from it until 76 years old - then they died when they were 80. Their disease survival was less than 5 years. The proportion of their life that they "had" the disease was short.

Now here's the same person, with early detection that made no difference to when they died - but the needle on how long they have "had" the disease has shifted. So they now survive longer than 5 years with the disease. The "lead time" has changed, but survival in the way we mean it hasn't changed at all.

Randomized trials are needed to establish that in fact early detection and intervention programs do more good than harm - some do, some don't.

More Statistically Funny on screening - "You have the right to remain anxious" and on over-diagnosis: here and here.

Here's a fact sheet about what you need to know about screening tests. And here's a little more technical primer of the 3 biases explained here.

Or my post at PLOS Blogs, The Disease Prevention Illusion: A Tragedy in Five Parts.

Does it work? Beware of the too-simple answer

2013-11-17T13:55:00.001-05:00

Leonard is so lucky! He's just asked a very complicated question and he's not getting an over-confident and misleading answer. Granted, he was likely hoping for an easier one! But let's dive into it.

"Does": that verb packs a punch. How do we know whether something does or doesn't work? It would be great if that were simple, but unfortunately it's not.

I talk a lot here at Statistically Funny about the need for trials and systematic reviews of them to help us find the answers to these questions. But whether we're talking about trials or other forms of research, statistical techniques are needed to help make sense of what emerges from a study.

Too often, this aspect of research is going to lead us down a garden path. It's common for people to take the approach of relying only, or largely, on testing for statistical significance. People often assume this means it's categorical proof of whether or not something was a coincidence.

However a statistically significant result - especially from a single study - is often misunderstood and contributes to over-confidence about what we know. It's not a magical wand that finds out the truth. The numbers alone cannot possibly "know" that. Here's my quick overview: 5 tips to avoid getting this wrong. I wrote about testing for statistical significance in some detail at Absolutely Maybe. Leonard's statistician is a Bayesian: you can find out some more about that, too, in that post.

As chance would have it, there was also a lot of discussion this week in response to a paper published while I was writing that post. It called for a tightening of the threshold for significance, which isn't really the answer either. Thomas Lumley puts that into great perspective over at his wonderful blog, Biased and Inefficient: a very valuable read.

"It": now that part of our question should be easy, right? Actually, this can be particularly tricky. The treatment you could be using may not be very much like the one that was studied. Even if it's a prescription drug, the dose or regimen you're facing might not be the same as the one used in studies. Or it might be used with another intervention that could affect how it works.

Then there's the question of whether "it" is even what it says it is. Unlike prescription drugs, for example, the contents of herbal remedies and dietary supplements aren't closely regulated to ensure that what it says on the label is what's inside. That was also recently in the news, and covered in detail here by Emily Willingham.

If it's a non-drug intervention, it's actually highly likely that the articles and other reports of the research don't ever make clear exactly what "it" is. Paul Glasziou had a brainwave about this: he's started HANDI: the Handbook of Non-Drug Intervention. When a systematic reviews shows that something works, the HANDI team wants to dig out all the details and make sure we all know exactly what "it" is.

For example, if you heard that drinking water before meals can help you lose weight, and you want to try it, HANDI helpfully points out what that actually means is drinking half a liter of water before every meal AND having a low-calorie diet.

"Work": this one really needs to get specific. You really need to be thinking about each possible outcome separately - the evidence is going to vary in quality and quantity from one outcome to another. I explain this in another post here.

Think of it this way: if you do a survey with 150 questions in it, there are going to be more answers to some of the questions than others. For example, if you had 400 survey respondents, they might all have answered the first easy question and there could be virtually no answers to a hard question near the end. So thinking "a survey of 400 people found…" an answer to that later question is going to be seriously misleading.

On top of that, you have to take the possible adverse effects into account, too. There can be complicated trade-offs between effects. And how much does it work for a particular outcome? Does a sliver of a benefit count to you as "working"? That might be enough for the person answering your question, but it might not be enough for it to count for you - especially if there are risks, costs or inconvenience involved.

And who did it work for in the research? Whether or not research results apply to a person in your situation can be straightforward, but it might not be.

Then there's the question of how high did researchers set the bar? Did the treatment effect have to be superior to doing nothing, or doing something else - or is the information coming from comparing it to something else that itself may not be all that effective? You might think that can't possibly happen, but it does more often than you might think. You can find out about this here at Statistically Funny, where I tackle the issue of drugs that are "no worse (more or less)."

Finally, one of the most common trip-ups of all: did they really measure the outcome, or a proxy for it? If it's a proxy for the real thing, how good a proxy is it? The use of surrogate measures or biomarkers is increasing fast: more about that in this post.

So while there are many who might have told Leonard, "Yes, it's been proven to work in clinical trials" in a few seconds flat, I wonder how long it would take his statistician to answer the question? There are no stupid questions, but beware of the too-simple answer.

More than one kind of self-control

2013-09-16T03:22:00.003-04:00

If you like reading randomized trials about skin and oral health treatments - and who doesn't? - you come across a few split-face and split-mouth ones. Instead of randomizing groups of people to different interventions so that a group of people can be a control group (parallel trials), sections of a person are randomized.

It's not only done with faces and teeth. Pairs of body parts can be randomized too, like arms or legs. These studies are sometimes called "within-person" trials. This kind of randomization means that you need fewer people in the trial, because you don't have to account for all the variations between human beings.

It has to be a treatment that affects only the specific area of the body treated, though. Anything that could have an influence on the "control" part is called a spill-over effect. There are still inevitably things that happen that affect the whole person, and those have to be accounted for with this kind of trial. Body part randomization is one of several ways a person can be their own control: the n of 1 trial is another way.

Randomizing sections didn't start in trials with people: it began with split-plot experiments in agricultural research. The idea was developed by the pioneer statistician, Sir Ronald Aylmer Fisher, who had done breeding experiments. He explained the technique in his classic 1925 text, "Statistical Methods for Research Workers."

It's great to see that neither blackheads nor treatment effects are hampering the Twilling sisters' style! They do seem to be at risk of susceptibility to the skincare industry's hard sells, though. Those issues are the subject of my post Blemish: The Truth About Blackheads.

Alleged effects include howling

2013-07-28T13:55:00.003-04:00

When dogs howl at night, it's not the full moon that sets them off. Dogs are communicating for all sorts of reasons. We're just not all that good at understanding what they're saying.

We make so many mistakes about attributing cause and effect for so many reasons, that it's almost surprising we get it right as often as we do. But all those mistaken beliefs we realize we have, don't seem to teach us a lesson. Pretty soon after catching ourselves out, we're at it again, taking mental shortcuts, being cognitive misers.

It's so pervasive, you would think we would know this about ourselves, at least, even if we don't understand dogs. Yet we commonly under-estimate how much bias is affecting our beliefs. That's been dubbed the bias blind spot that we tend to live in.

Even taking all that into account, "effect" is an astonishingly over-used word, especially in research and science communication where you would hope people would be more careful. The maxim that correlation (happening at the same time) does not necessarily mean causation has spread far and wide, becoming something of a cliche along the way.

But does that mean that people are as careful with the use of the word "effect" as they are with the use of the "cause" word? Unfortunately not.

Take this common one: "Side effects include...." Well, actually, don't be so fast to swallow that one. Sometimes, genuine adverse effects will follow that phrase. But more often, the catalogue that follows is not adverse effects, but a list of adverse events - things that happened (or were reported). Some of them may be causally related, some might not be.

You have to look carefully at claims of benefits and harms. Even researchers who aren't particularly biased will word it carelessly. You will often hear that 14% experienced nausea, say - without it being pointed out that 13% of people on placebos also experienced nausea, and the difference wasn't statistically significant. Some adverse effects are well known, and it doesn't matter (diarrhea and antibiotics, say). That's not always so, though. (More on this: 5 Key Things to Know About Adverse Effects.)

If the word "effect" is over-used, the word "hypothesis" is under-used. Although generating hypotheses is a critical part of science, hypotheses aren't really marketed as what they are: ideas in need of testing. Often the language is that of attribution throughout, with a little fig-leaf of a sentence tacked on about the need for confirmatory studies. In fact, we cannot take replication and confirmation for granted at all.

Goldilocks and the three reviews

2013-06-30T22:44:00.004-04:00

Goldilocks is right: that review is FAR too complicated. The methods section alone is 652 pages long! Which wouldn't be too bad, if it weren't that it is a few years out of date. It took so long to do this review and go through rigorous enough quality review, it was already out of date the day it was released. Something that happens often enough to be rather disheartening.

When methodology for systematic reviewing gets overly rococo, the point of diminishing returns will be passed. That's a worry, for a few reasons. For one, it's inefficient and more reviews could be done with the resources. Secondly, more complex methodology can both be daunting, and it can be hard for researchers to accomplish with consistency. Thirdly, when a review gets very elaborate, reproducing or updating it isn't going to be easy either.

It's unavoidable for some reviews to be massive and complex undertakings, though, if they're going to get to the bottom of massive and complex questions. Goldilocks is right about review number 2, as well: that one is WAY too simple. And that's a serious problem, too.

Reviewing evidence needs to be a well-conducted research exercise. A great way to find out more about what goes wrong when it's not, is reading Testing Treatments. And see more on this here at Statistically Funny, too.

You need to check the methods section of every review before you take its conclusions seriously - even when it claims to be "evidence-based" or systematic. People can take far too many shortcuts. Fortunately, it's not often that a review gets as bad as the second one Goldilocks encountered here. The authors of that review decided to include only one trial for each drug "in order to keep the tables and figures to a manageable size." Gulp!

Getting to a good answer also quite simply takes some time and thought. Making real sense of evidence and the complexities of health, illness and disability is often just not suited to a "fast food" approach. As the scientists behind the Slow Science Manifesto point out, science needs time for thinking and digesting.

To cover more ground, people are looking for reasonable ways to cut corners, though. There are many kinds of rapid review, including reliance on previous systematic reviews for new reviews. These can be, but aren't always, rigorous enough for us to be confident about their conclusions.

You can see this process at work in the set of reviews discussed at Statistically Funny a few cartoons ago. Review number 3 there is in part based on review number 2 - without re-analysis. And then review number 4 is based on review number 3.

So if one review gets it wrong, other work may be built on weak foundations. Li and Dickersin suggest this might be a clue to the perpetuation of incorrect techniques in meta-analyses: reviewers who got it wrong in their review, were citing other reviews that had gotten it wrong, too. (That statistical technique, by the way, has its own cartoon.)

Luckily for Goldilocks, the bears had found a third review. It had sound methodology you can trust. It had been totally transparent from the start - included in PROSPERO, the international prospective register for systematic reviews. Goldilocks can get at the fully open review and its data are in the Systematic Review Data Repository, open to others to check and re-use. Ahhh - just right!

PS:

I'm grateful to the Wikipedians who put together the article on Goldilocks and the three bears. That article pointed me to the fascinating discussion of "the rule of three" and the hold this number has on our imaginations.

Studies of cave paintings have shown....

2013-06-23T12:51:00.001-04:00

The mammoth has a good point. Ogg's father is making a classic error of logic. Not having found proof that something really happens, is not the same as having definitive proof that this thing cannot possibly happen.

Ogg's family doesn't have the benefit of Aristotle's explanation of deductive reasoning. But even two thousand years after Aristotle got started, we still often fall into this trap.

In evidence-based medicine, a part of this problem is touched on by the saying, "absence of evidence is not evidence of absence." A study says "there's no evidence" of a positive effect, and people jump to the conclusion - "it doesn't work." Baby Ogg gets thrown out with the bathwater.

The same thing is happening when there are no statistically significant serious adverse effects reported, and people infer from that, "it's safe."

This situation is the opposite of the problem of reading too much into a finding of statistical significance (explained here). Only in this case, people are over-interpreting non-significance. Maybe the researchers simply didn't study enough of the right people, or they weren't looking at the outcomes that later turn out to be critical.

Researchers themselves can over-interpret negative results. Or they might phrase their conclusions carelessly. Even if they avoid the language pitfalls here, journalists could miss the nuance (or think the researchers are just being wishy-washy) and spread the wrong message. And even if everyone else phrased it carefully, the reader might jump to that conclusion anyway.

When researchers say "there is no evidence that...", they generally mean they didn't find any, or enough of, a particular type of evidence that they would find convincing. Obviously, no one can ever be sure they have even seen all the evidence. And it doesn't mean everyone would agree with their conclusion, either. To be reasonably sure of a negative, you might need quite a lot of evidence.

On the other hand, the probability of something being extremely unlikely to be real based on quite a lot of knowledge - that there's a community of giant blue swans with orange and pink polka dots on the Nile, say - increases the confidence you might have in even a small study exploring that hypothesis.

In 2020 during the Covid-19 pandemic, we found out how deeply another problem goes: taking the absence of particular types of evidence as the rationale for not taking public health action. Early in April I wrote in WIRED about how this was leading us to policies that didn't make sense – especially in not recommending personal masks to help reduce community transmission. At the same time, Trisha Greenhalgh and colleagues pointed out that was ignoring the precautionary principle: it's important to avoid doing harm caused by not taking other forms of evidence seriously enough. When it was finally acknowledged that the policy had to change, it was a recipe for chaos.

Which brings us to the other side of this coin. Proving that something doesn't exist to the satisfaction of people who perhaps need to believe it most earnestly, can be quite impossible. People trying to disprove the claim that vaccination causes autism, for example, are finding that despite the Enlightenment, our rational side can be vulnerable to highjacking. Voltaire hit that nail on the head in the 18th century: "The interest I have to believe a thing is no proof that such a thing exists."

~~~~

Update 3 July 2020: Covid-19 paragraph added.