Statistically Funny: 2014

Sunday, November 30, 2014

Biomarkers Unlimited: Accept Only OUR Substitutes!

Sounds great, doesn't it? Getting clinical trial results quickly has so much going for it. Information sooner! More affordable trials!

Substituting outcomes that can take years, or even decades, to emerge, with ones you can measure much earlier, makes clinical research much simpler. This kind of substitute outcome is called a surrogate (or intermediate) endpoint or outcome.

Surrogates are often biomarkers - biological signs of disease or a risk factor of disease, like cholesterol in the blood. They are used in clinical care to test for, or keep track of, signs of emerging or progressing disease. Sometimes, like cholesterol, they're the target of treatment.

The problem is, these kinds of substitute measures aren't always reliable. And sometimes we find that out in the hardest possible way.

The risk was recognized as soon as the current methodology of clinical trials was being developed in the 1950s. Austin Bradford Hill, who played a leading role, put it bluntly: if the "rate falls, the pulse is steady, and the blood pressure impeccable, we are still not much better off if unfortunately the patient dies."

That famously happened with some drugs that controlled cardiac arrhythmia - irregular heartbeat that increases the chances of having a heart attack. On the basis of ECG tests that showed the heartbeat was regular, these drugs were prescribed for years before a trial showed that they were causing tens of thousands of premature deaths, not preventing them. That kind of problem has happened too often for comfort.

It happened again this week - although at least before the drug was ever approved. A drug company canceled all its trials for advanced gastric (stomach) cancer of a new drug. The drug is called Rilotumumab. Back in January, it was a "promising" treatment, billed as bringing "new hope in gastric cancer." It got through the early testing phases and was in Phase III trials - the kind needed to get FDA approval.

But one phase III trial, RILOMET-1, quickly showed an increase in the number of deaths in people using the drug. We don't know how many yet - but it was enough for the company to decide to end all trials of the substance.

This drug targets a biomarker associated with worse disease outcomes, an area seen by some as transforming gastric cancer research and treatment. Others see considerable challenges, though - and what happened to the participants in the RILOMET-1 trial underscores why.

There is a lot of controversy about surrogate outcomes - and debates about what's needed to show that an outcome or measure is a valid surrogate we can rely on. They can lead us to think that a treatment is more effective than it really is.

Yet a recent investigative report found that cancer drugs are being increasingly approved based only on surrogate outcomes, like "progression-free survival." That measures biomarker activity rather than overall survival (when people died).

It can be hard to recognize at first, what's a surrogate and what's an actual health outcome. One rule of thumb is, if you need a laboratory test of some kind, it's more likely to be a surrogate. Whereas symptoms of the disease you're concerned, or harm caused by the disease, are the direct outcomes of interest. Sometimes those are specified as"patient-relevant outcomes."

Many surrogate outcomes are incredibly important, of course - viral load for HIV treatment and trials for example. But in general, when clinical research results are based only on surrogates, the evidence just isn't as strong and reliable as it is for the outcomes we are really concerned about.

~~~~

See also, Statistically Funny on "promising" treatments.

Sunday, October 12, 2014

Sheesh - what are those humans thinking?

I can neither confirm nor deny that Cecil is now a participant in one of the there-is-no-limit-to-the-human-lifespan resveratrol studies at Harvard's "strictly guarded mouse lab"! If he is, I'm sure he's even more baffled by the humans' hype over there.

Resveratrol is the antioxidant in grapes that many believe makes drinking red wine healthy. And it's a good example of how research on animals is often terribly misleading and misinterpreted. I've written about it over at Absolutely Maybe if you're interested in a classic example of the rise and fall of animal-research-based hype (or more detail about resveratrol).

But this week, it's media hype about a study using human stem cells in mice in another lab at Harvard that's made me ratty. You could get the idea that a human trial of a "cure" for type 1 diabetes is just a matter of time now - and not a lot of time at that. According to the leader of the team, Doug Melton, "We are now just one preclinical step away from the finish line."

An effective treatment that ends the need for insulin injections would be incredibly exciting. But we see this kind of claim from laboratory research all the time, don't we? How often does it work out - even for the studies that are at "the finishing line" for animal studies?

Not all that often: maybe about a third of the time.

Bart van der Worp and colleagues wrote an excellent paper explaining why. It's not just that other animals are so different from humans. We're far less likely to hear of the failed animal results than we are of human trials that don't work out as hoped. That bias towards positive published results draws an over-optimistic picture.

As well as fundamental differences between species, van der Worp points to other common issues that reduce the applicability for humans of typical studies in other animals:

The animals tend to be younger and healthier than the humans who have the health problem;
They tend to be a small group of animals that are very similar to each other, while the humans with the problem are a large very varied group;
Only male or only female animals are often used; and
Doses higher than humans will be able to tolerate are generally used.

Limited genetic diversity could be an issue, too.

So how does the Harvard study fare on that score? They used stem cells to develop insulin-producing cells that appeared to function normally when transplanted into mice. But this was the very early stages. When it came to the test they reported on the ones with diabetes, there were only 6 (young) mice who got the transplants (and 1 died) (plus a comparison group). Gender was not reported - and as is common in laboratory animal studies, there wasn't lengthy follow-up. This was an important milestone, but there's a very long way to go here. Transplants in humans face a lot of obstacles.

Van der Worp points to another set of problems: inadequacies in research methods that we've learned over time in human research bias the proceedings too much - including problems with statistical analyses. Jennifer Hirst and colleagues have studied this too. They concluded that so many studies were bedeviled by issues such as lack of randomization and blinding by those assessing outcomes, that they should never have been regarded as being "the finishing line" before human experimentation at all.

There's good news though! CAMARADES is working to improve this - with the same approach for chipping away at these problems as in human trials: by slogging away at biased methodologies and publication bias. And pushing for good quality systematic reviews of animal studies before human trials are undertaken. It's well worth half an hour to watch the wonderful talk by Emily Sena at Evidence Live 2015.

Laboratory animal research may be called "preclinical," but even that jargon is a bit of over-optimistic marketing. Most of what's tried in the lab will never get near human trials. And when it does, it will mostly be disappointing. Laboratory research is needed, and encouraging progress is great. But people should definitely not be getting our hopes up too much about it.

~~~~

The National Institutes of Health (NIH) addressed the issue of gender in animal experiments earlier in 2014. After I wrote this post, the NIH also released proposed guidelines for reporting preclinical research.

Thanks to Jonathan Eisen for adding a link for the full text of the paper to PubMed Commons, as well as to a blog post by Paul Knoepfler discussing the context of the stem cell work by Felicia Pagliuca, Doug Melton and colleagues. NHS Behind the Headlines have also analyzed and explained this study.

Thanks to Jim Johnson for pointing an oversight: that animal studies - this one included - can also suffer from having too little follow-up.

Interest declaration: I'm an academic editor at one of the journals whose papers on animal research I commended (PLOS Medicine) and on the human ethics advisory group of another (PLOS One), but I had no involvement in either paper.

Update: Checked, post and cartoon refreshed, and link to Sena's talk at Evidence Live on 5 December 2015.

Sunday, March 16, 2014

If at first you don't succeed...

If only post hoc analyses always brought out the inner skeptic in us all! Or came with red flashing lights instead of just a little token "caution" sentence buried somewhere.

Post hoc analysis is when researchers go looking for patterns in data. (Post hoc is Latin for "after this.") Testing for statistically significant associations is not by itself a way to sort out the true from the false. (More about that here.) Still, many treat it as though it is - especially when they haven't been able to find a "significant" association, and turn to the bathwater to look for unexpected babies.

Even when researchers know the scientific rules and limitations, funny things happen along the way to a final research report. It's the problem of researchers' degrees of freedom: there's a lot of opportunity for picking and choosing, and changing horses mid-race. Researchers can succumb to the temptation of over-interpreting the value of what they're analyzing, with "convincing self-justification." (See the moving goalposts over time here, for example, as trialists are faced with results that didn't quite match their original expectations.)

And even if the researchers don't read too much into their own data, someone else will. That interpretation can quickly turn a statistical artifact into a "fact" for many people.

Let's look more closely at Significus' pet hate: post hoc analyses. There are dangers inherent in multiple testing when you don't have solid reasons for looking for a specific association. The more often you randomly dip into data without a well-founded target, the higher your chances of pulling out a result that will later prove to be a dud.

It's a little like fishing in a pond where there are random old shoes among the fish. The more often you throw your fishing line into the water, the greater your chances of snagging a shoe instead of a fish.

Here's a study designed to show this risk. The data tossed up significant associations such as: women were more likely to have a cesarean section if they preferred butter over margarine, or blue over black ink.

The problem is huge in areas where there's a lot of data to fish around in. For published genome-wide association studies, for example, over 90% of the "associations" with a disease couldn't consistently be found again. Often, researchers don't report how many tests were run before they found their "significant" results, which makes it impossible for others to know how big a problem multiple testing might be in their work.

The problem extends to subgroup analyses where there is not an established foundation for an association. The credibility of claims made on subgroups in trials is low. And it has serious consequences. For example, an early trial suggested only men with stroke-like symptoms benefit from aspirin - which stopped many doctors from prescribing aspirin to women.

How should you interpret post hoc and subgroup analyses then? If analyses were not pre-specified and based on established, plausible reasons for an association, then one study isn't enough to be sure.

With subgroups that weren't randomized as different arms of a trial, it's not enough that the average for one subgroup is higher than the average for another subgroup. There could be other factors influencing the outcome other than their membership of that subgroup. An interaction test is done to try to account for that.

It's more complicated when it's a meta-analysis, because there are so many differences between one study and another. The exception here is an individual patient data meta-analysis, which can study differences between patients directly.

In the end, it comes down to being careful not to see a new hypothesis generated by research as a "fact" already proven by the study from which it came.

Post hoc, ergo propter hoc. This description of basic faulty logic - "after this, therefore because of this" - is as ancient as the language that made it famous. We've had millennia to snap out of the dangerous mental shortcut of seeing a cause where there's only coincidence. Yet we still hurtle like lemmings over cliffs into its alluring clutches.

More on multiple testing at Statistically Funny.