How Much Should We Trust Clinical Trials?

Suppose you ask several experts how to choose a good car. Their answers reveal they don’t know how to drive. What should you conclude? Suppose these experts build cars. Should we trust the cars they’ve built?

Gina Kolata writes that “experts agree that there are three basic principles that underlie the search for medical truth and the use of clinical trials to obtain it.” Kolata’s “three basic principles” reveal that her experts don’t understand experimentation.

Principle 1.  “It is important to compare like with like. The groups you are comparing must be the same except for one factor — the one you are studying. For example, you should compare beta carotene users with people who are exactly like the beta carotene users except that they don’t take the supplement.” An expert told her this. This — careful equation of two groups — is not how experiments are done. What is done is random assignment, which roughly (but not perfectly) equates the groups on pre-experimental characteristics. A more subtle point is that the X versus No X design is worse than a design that compares different dosages of X. The latter design makes it less likely that control subjects will get upset because they didn’t get X and makes the two groups more equal.

Principle 2. “The bigger the group studied, the more reliable the conclusions.” Again, this is not what happens. No one with statistical understanding judges the reliability of an effect by the size of the experiment; they judge it by the p value (which takes account of sample size). The more subtle point is that the smaller the sample size, the stronger the effect must be to get reliable results. Researchers try to conserve resources so they try to keep experiments as small as possible. Small experiments with reliable results are more impressive than large experiments with equally reliable results — because the effect must be stronger. This is basically the opposite of what Kolata says.

Principle 3. In the words of Kolata’s expert, it’s “Bayes theorem”. He means consider other evidence — evidence from other studies. This is not only banal, it is meaningless. It is unclear — at least from what Kolata writes — how to weigh the various sources of evidences (what if the other evidence and the clinical trials disagree?).

Kolata also quotes David Freedman, a Berkeley professor of statistics who knew the cost of everything and the value of nothing. Perhaps it starts in medical school. As I blogged, working scientists, who have a clue, don’t want to teach medical students how to do research.

If this is the level of understanding of the people who do clinical trials, how much should we trust them? Presumably Kolata’s experts were better than average — a scary thought.

12 Replies to “How Much Should We Trust Clinical Trials?”

  1. Well I don’t know about clinical trials, but I know we shouldn’t ever trust Kolata. She had an illuminating back-and-forth with Gary Taubes at one point (you documented it) which demonstrated her inability to read and understand simple English sentences, much less basic science.

  2. It sounds like this reporter talked with the wrong experts. Not to overgeneralize, but I’ve noticed that a lot of biologists have a pretty naive understanding of statistics. It would maybe better for her to talk with some econometricians, psychometricians, or quantitative political scientists and sociologists.

    That said, I think you’re confused on Principle 2 above. With a small sample size, you can still find statistically significant differences by chance–and they’ll look huge! Take a look at my paper with Weakliem that just appeared in American Scientist.

  3. Andrew, she wanted to understand clinical trials, so she talked to some experts who do them. What she found revealed incompetence, which is interesting. Talking to quantitative political scientists or psychometricians wouldn’t be a good way to learn about clinical trials.
    Principle 2: An experimental psychologist who read an experiment (in psychology) with a large sample size (e.g., n = 20) would be suspicious: Why such a large sample size? It must mean the effect is weak or maybe they did the experiment with n = 10 (a typical size) and didn’t find anything. Anyway, it would mean something was off. In this sense, the larger the sample size, the less trustworthy.

  4. Seth:

    In the best of all worlds, Kolata would’ve spoken with a medical-statistics expert such as Stephen Senn, John Carlin, or Chris Schmid. But, given whom she did talk with, I’m thinking it would’ve helped for her to broaden her understanding by talking with some social science statisticians.

    Regarding principle 2: it depends on what you’re studying. If it’s a rare condition, you might need a large sample size to get lots of cases. Or if there’s a high level of natural variability, you’ll need a high sample size to see signal amid noise. The example I was referring to was the sex ratio of babies, which is close to purely random. As we discuss in our paper, you need a very large sample size to discover patterns here. You might argue that a 1% change in the probability of a girl birth is so tiny that nobody should care about it–and maybe you’re right–but that’s the context of some things that people study. Medical outcomes can be highly unpredictable, and small effects can be of interest to people.

    Anyway, my main point is not to defend large studies but to disagree with the implications of your two statements, claiming (1) people with statistical understanding judge “the reliability of an effect” by the p-value, and (2) “small experiments with reliable results are more impressive than large experiments with equally reliable results.” Not so. As discussed in our American Scientist article, statistical significance doesn’t necessarily tell you much at all, if the estimate is so large as to be scientifically implausible. That’s something you can learn from statistical power analysis, or from Bayesian inference.

    Finally . . . lots of psychology studies have n>20. Just for example, my sister’s most cited article is based on a study with 104 kids. If you can find it with n=10, great. But I don’t think that experimental psychologists have been barraging Susan with questions about why her sample size is so much more than 10.

  5. Andrew, by “effect” I meant experimental effect. The whole discussion is about experiments.  The sex ratio stuff in your Amer Scientist article isn’t experimental (= does not come from experiments). I’m happy to learn about an example that contradicts what I said but it would need to be an experiment.

    Your sister’s research isn’t experimental psychology, it’s developmental psychology. I agree, the term experimental psychology (= perceptual and cognitive psychology and animal learning) isn’t terribly clear to outsiders. Developmental psychology experiments tend to have larger n’s than experimental psychology experiments.

  6. Yes, it sounds like you use the terms “effect” and “experiment” in different ways than statisticians do. Which is fine; I realize that our usages aren’t always so intuitive.

  7. You’re being unusually uncharitable in your reading here, Seth. I don’t see anything inaccurate in her article. It’s a bit imprecise or unclear in places (for instance, she shouldn’t have said “exactly”), and it all seems pretty basic, but I don’t see this deep ignorance of research design that you’re reading into her article.

    Her first principle is that you need to eliminate confounding variables so that you can be confident that differences are due to the factor that you’re trying to study. She describes random assignment as the standard way to do this (I’m not sure why you think she doesn’t understand random assignment when she discusses it right there, explaining why randomization is better than observational studies that try to statistically control for differences). The second principle is saying (correctly) that larger studies give you a more precise estimate of the effect size. Studies with a smaller sample size have wider confidence intervals. A point estimate of a 20% reduction in risk may be misleading if the confidence interval is a 5%-35% reduction. The third principle is that other evidence can continue to be relevant after you’ve done a full study with random assignment. She seems to reach the correct conclusions about the two examples that she describes, one (prayer) where she thinks you should doubt the results of the study because of other evidence and one (beta carotene) where she thinks that you should trust the results of the study despite the other evidence, although you’re right that she doesn’t give much of an explanation of how to reach these conclusions.

  8. True, she does mention randomization. Maybe her mistake was to ask an epidemiologist about clinical trials. Not realizing that epidemiologists do surveys, not experiments. Equation of the groups being compared is a much bigger deal for epidemiologists than experimenters.

    I don’t think it’s obvious that the beta-carotene clinical trials are more trustworthy than the other beta-carotene studies. I’d have to know a lot more about the details before I’d reach that conclusion. For example, large clinical trials allow vast possibilities for data entry errors, which will reduce differences between groups. I know an example where a transcription error wasn’t noticed for 40 years. Did the MRFIT clinical trial reach the right conclusion (of no effect)? It’s still hard to know.

    What neither Kolata nor her experts understand is that until something more accurate than “randomized clinical trials” comes along, we have no way of generally assessing their accuracy — just as the problem with eyewitness testimony only became apparent when DNA testing came along.

  9. thanks for your posting. I loved the comments and that back and forth.
    It reminds me, in some slight way, about how clinical trials come about, and how they are used to find both negative and positive results.
    We all need to go back and forth and finalize and make our opinions clearer as we move forward.
    Thank you for an enlightening post!

Comments are closed.