Statistically Funny: 2016

Sunday, September 11, 2016

The Highs and Lows of the "Good Study"

Reporting on a study as though it's a weather report. Scientist at a desk has a map behind them, showing "sunny" for education, but cloudy for data on under-18s. (Cartoon by Hilda Bastian.)

Imagine if weather reports only gave the expected average temperature across a whole country. You wouldn't want to be counting on that information when you were packing for a trip to Alaska or Hawaii, would you?

Yet that's what reports about the strength of scientific results typically do. They will give you some indication of how "good" the whole study is: and leave you with the misleading impression that the "goodness" applies to every result.

Of course, there are some quality criteria that apply to the whole of a study, and affect everything in it. Say I send out a survey to 100 people and only 20 people fill it in. That low response rate affects the study as a whole.

You can't just think about the quality of a study, though. You have to think about the quality of each result within that study. The likelihood is, the reliability of data will vary a lot.

For example, that imaginary survey could find that 25% of people said yes, they ate ice cream every week last month. That's going to be more reliable data than the answer to a question about how many times a week they ate ice cream 10 years ago. And it's likely to be less reliable than their answers to the question, "What year were you born?"

Then there's the question of missing data. Recently I wrote about bias in studies on the careers of women and men in science. A major data set people often analyze is a survey of people awarded PhDs in the United States. Around 90% of people answer it.

But within that, the rate of missing data for marital status can be around 10%, while questions on children can go unanswered 4 times as often. Conclusions based on what proportion of people with PhDs in physics are professors will be more reliable than conclusions on how many people with both PhDs in physics and school-age children are professors.

One of the most misleading areas of all for this are the abstracts and news reports of meta-analyses and systematic reviews. It will often sound really impressive: they'll tell you how many studies, and maybe how many people are in them, too. You could get the impression then, that this means all the results they tell you about have that weight behind them. The standard-setting group behind systematic review reporting says you shouldn't do that: you should make it clear with each result. (Disclosure: I was part of that group).

This is a really big deal. It's unusual for every single study to ask exactly the same questions, and gather exactly the same data, in exactly the same way. And of course that's what you need to be able to pool their answers into a single result. So the results of meta-analyses very often draw on a subset of the studies. It might be a big subset, but it might be tiny.

To show you the problem, I did a search this morning at the New York Times for "meta-analysis". I picked the first example of a journalist reporting on specific results of a meta-analysis of health studies. It's this one: about whether being overweight or obese affects your chances of surviving breast cancer. Here's what the journalist, Roni Caryn Rabin wrote - and it's very typical:

"Just two years ago, a meta-analysis crunched the numbers from more than 80 studies involving more than 200,000 women with breast cancer, and reported that women who were obese when diagnosed had a 41 percent greater risk of death, while women who were overweight but whose body mass index was under 30 had a 7 percent greater risk".

There really was not much of a chance that all the studies had data on that - even though you would be forgiven for thinking that when you looked at the abstract. And sure enough, this is how it works out when you dig in:

There were 82 studies and the authors ran 31 basic meta-analyses;
The meta-analytic result with the most studies in it included 24 out of the 82;
84% of those results combined 20 or fewer studies - and 58% had 10 or less. Sometimes only 1 or 2 studies had data on a question;
The 2 results the New York Times reported came from about 25% of the studies and less than 20% of the women with breast cancer.

The risk data given in the study's abstract and the New York Times report did not come from "more than 200,000 women with breast cancer". One came from over 42,000 women and the other from over 44,000. In this case, still a lot. Often, it doesn't work that out way, though.

So be very careful when you think, "this is a good study". That's a big trap. It's not just that all studies aren't equally reliable. The strength and quality of evidence almost always varies within a study.

Want to read more about this?

Here's an overview of the GRADE system for grading the strength of evidence about the effects of health care.

I've written more about why it's risky to judge a study by its abstract at Absolutely Maybe.

And here's my quick introduction to meta-analysis.

Sunday, August 14, 2016

Cupid's Lesser-Known Arrow

Cupid's famous arrow causes people to fall blindly in love with each other. That can end happily ever after. Not so with his lesser known "immortal time bias" arrow! That one causes researchers to fall blindly in love with profoundly flawed results - and that never ends well.

This type of time-dependent bias often afflicts observational studies. It's a particular curse for those studies relying on the "big data" from medical records. A recent study found close to 40% of susceptible studies in prominent medical journals were "biased upward by 10% or more". A study in 2011 found that 62% of studies of postoperative radiotherapy didn't safeguard against immortal time bias. That could make treatment look more effective than it really is.

So what is it? It's a stretch of time where an outcome couldn't possibly occur for one group - and that gives them a head start over another group. Samy Suissa describes a classic case from the early days of heart transplantation in the 1970s. A 1971 study showed 20 people who had heart transplants at Stanford lived an average of 200 days compared to 14 transplant candidates who didn't get them and survived an average of 34 days.

Those researchers had started the clock from the point at which all 34 people had been accepted into the program. Now of course, all the people who got the transplants were alive at the time of surgery. For the stretch of time they were on the waiting list, they were "immortal": you could not die and still get a heart transplant. So when people on the waiting list died early, they were in the no-transplant group.

When the data were re-analyzed by others in 1974 to factor this into account, the survival advantage of the operation disappeared. (More about the history in Hanley and Foster's article, Avoiding blunders involving 'immortal time'.)

This bias is also called survivor or survival bias, or survivor treatment selection bias. But time-dependent biases don't only affect death as an outcome. It can affect any outcome, not just death. So "immortal time" isn't really the best term. Hanley and Foster call it event-free time.

Carl von Walraven and colleagues are among the group that call this kind of phenomenon "competing risk bias":

Competing risks are events whose occurrence precludes the outcome of interest.

They are the authors of the 2016 study I mentioned above about how common the problem is. They show the impact on data in a study they did themselves on patient discharge summaries.

If you were re-admitted to hospital before you got to a physician visit with your discharge summary, you didn't fare as well as the people who went to the doctor. If you just compare the group who went to the physician for follow-up as the hospital encouraged with the group who didn't, the group who didn't visit their doctor had way higher re-admission rates. Not much surprise there, eh?

Von Walraven says the risk grew as people started to do more time-to-event studies. They put the problem down partly to the popularity of a method for survival ratios that doesn't recognize these risks in its basic analyses. That's Kaplan-Meier risk estimation. You see Kaplan-Meier curves referred to a lot in medical journals.

Although they're called curves, I think they look more like staircases. Here's an example: number of months survived here starts off the same, but gets better for the blue line after a year, plateauing a couple of years later.

Some common statistical programs don't have a way to deal with time-dependent calculations in Kaplan-Meier analyses, according to von Walraven. You need extensions of the programs to handle some data properly. The Royal Statistical Society points to this problem too, in the description for their 2-day course on Survival Analysis. (One's coming up in London in September 2016.)

Hanley and Foster have a great guide to recognizing immortal time bias (Table 1, page 956). The key, they say, is to "Think person-time, not person":

If authors used the term 'group', ask... When and how did persons enter a 'group'? Does being in or moving to a group have a time-related requirement?

Given the problem is so common, we have to be very careful when we read observational studies with time-to-event outcomes and survival analyses. If authors talk about cumulative risk analyses and accounting for time-dependent measures, that's reassuring.

But what we really need is for the people who do these studies - and all the information gatekeepers, from peer reviewers to journalists - to learn how to dodge this arrow.

More reading on a somewhat lighter note: my post at Absolutely Maybe on whether winning awards or elections affects longevity.

~~~~

The Kaplan-Meier "curve" image was chosen without consideration of its data or the article in which it appears. I used the National Library of Medicine's Open i images database, and erased explanatory details to focus only on the "curve". The source is an article by Kadera BE et al (2013) in PLOS One.