Causal Reasoning in Science: Don’t Dismiss Correlations

In a paper (and blog post), Andrew Gelman writes:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

Box, Hunter, and Hunter (1978) (a book called Statistics for Experimenters) is well-regarded by statisticians. Perhaps Box, Hunter, and Hunter, and Andrew, were/are unfamiliar with another quote (modified from Beveridge): “Everyone believes an experiment except the experimenter; no one believes a theory except the theorist.”

Box, Hunter, and Hunter were/are theorists, in the sense that they don’t do experiments (or even collect data) themselves. And their book has a massive blind spot. It contains 500 pages on how to test ideas and not one page — not one sentence — on how to come up with ideas worth testing. Which is just as important. Had they considered both goals — idea generation and idea testing — they would have written a different book. It would have said much more about graphical data analysis and  simple experimental designs, and, I hope, would not have contained the flat statement (“To find out what happens …”) Andrew quotes.

“To find out what happens when you change something, it is necessary to change it.” It’s not “necessary” because belief in causality, like all belief, is graded: it can take on an infinity of values, from zero (“can’t possibly be true”) to one (“I’m completely sure”). And belief changes gradually. In my experience, significant (substantially greater than zero) belief in the statement A changes B usually starts with the observation of a correlation between A and B. For example, I began to believe that one-legged standing would make me sleep better after I slept unusually well one night and realized that the previous day I had stood on one leg (which I almost never do). That correlation made one-legged standing improves sleep more plausible, taking it from near zero to some middle value of belief (“might be true, might not be true”) Experiments in which I stood on one leg various amounts pushed my belief in the statement close to one (“sure it’s true”). In other words, my journey “to find out what happens” to my sleep when I stood on one leg began with a correlation. Not an experiment. To push belief from high (say, 0.8) to really high (say, 0.99) you do need experiments. But to push belief from low (say, 0.0001) to medium (say, 0.5), you don’t need experiments. To fail to understand how beliefs begin, as Box et al. apparently do, is to miss something really important.

Science is about increasing certainty — about learning. You can learn from any observation, as distasteful as that may be to evidence snobs. By saying that experiments are “necessary” to find out something, Box et al. said the opposite of you can learn from any observation. Among shades of gray, they drew a line and said “this side white, that side black”.

The Box et al. attitude makes a big difference in practice. It has two effects:

  1. Too-complex research designs. Just as researchers undervalue correlations, they undervalue simple experiments. They overdesign. Their experiments (or data collection efforts) cost far more and take much longer than they should. The self-experimentation I’ve learned so much from, for example, is undervalued. This is one reason I learned so much from it — because it was new.
  2. Existing evidence is undervalued, even ignored, because it doesn’t meet some standard of purity.

In my experience, both tendencies (too-complex designs, undervaluation of evidence) are very common. In the last ten years, for example, almost every proposed experiment I’ve learned about has been more complicated than I think wise.

Why did Box, Hunter, and Hunter get it so wrong? I think it gets back to the job/hobby distinction. As I said, Box et al. didn’t generate data themselves. They got it from professional researchers — mostly engineers and scientists in academia or industry. Those engineers and scientists have jobs. Their job is to do research. They need regular publications. Hypothesis testing is good for that. You do an experiment to test an idea, you publish the result. Hypothesis generation, on the other hand, is too uncertain. It’s rare. It’s like tossing a coin, hoping for heads, when the chance of heads is tiny. Ten researchers might work for ten years, tossing coins many times, and generate only one new idea. Perhaps all their work, all that coin tossing, was equally good. But only one researcher came up with the idea. Should only one researcher get credit? Should the rest get fired, for wasting ten years? You see the problem, and so do the researchers themselves. So hypothesis generation is essentially ignored by professionals because they have jobs. They don’t go to statisticians asking: How can I better generate ideas? They do ask: How can I better test ideas? So statisticians get a biased view of what matters, do biased research (ignoring idea generation), and write biased books (that don’t mention idea generation).

My self-experimentation taught me that the Box et al. view of experimentation (and of science — that it was all about hypothesis testing) was seriously incomplete. It could do so because it was like a hobby. I had no need for publications or other steady output. Over thirty years, I collected a lot of data, did a lot of fast-and-dirty experiments, noticed informative correlations (“accidental observations”) many times, and came to see the great importance of correlations in learning about causality.









9 Replies to “Causal Reasoning in Science: Don’t Dismiss Correlations”

  1. It’s a great point that the world also needs better theory generation, but you’re reviewing a different book when you criticize their focus on experimental design. Was their book a failure since it wasn’t The Soup to Nuts Encyclopedia of the Scientific Method?
    Box, Hunter and Hunter were many things, but they were not evidence snobs. When I asked George Box, what he thought about requiring a p-value<0.05 to declare a factor "important", he said it was much more important to use your brain than to pass some artificial statistical threshold.
    It feels like you're stretching your thesis a bit (academics are more interested in earning status through publishing the arcane than doing "useful" work) when you also claim that engineers in industry are only interested in research. Process optimization and problem-solving are decidedly not ivory tower, and most of the industrial examples from BHH's book were real-world problems. Problem: too many defects on the production line? Answer: run a fancy experimental design to impress your colleagues, but still not fix the problem. Good luck in your next job.
    My reading of the quote about "you need to change something to learn" is not against using induction to generate new theories and it's not saying "correlation is not causation", it's about the simple truth that once you have an idea in your head, the only way to learn about it (get more confidence in your belief), is to test it. How did you learn about one-leg standing in the first place; you had to try it. By my reading the real-world problems they were addressing with their book were: 1) coming up with a theory, assuming it's true, and never testing it (a staple of executives in business) 2) over-reliance on one-factor at a time experiments to generate knowledge in a world where interactions exist.
    You're painting Box and the Hunters with the brush of the ivory-tower statistician that only believe results with a p value<0.05 from randomized studies when they were quite practical in using experimental design to improve real-world processes. The current efforts of statisticians in experimental design does fit 100% with your ideas about academia though. The designs get more esoteric, harder to use, and less useful to the engineers working on real-world problems.
    Since their book completely missed the theory-generation mark, do you have any recommendations of books that cover that topic well?

    1. I heard Box speak once. One of his examples involved design of a cake mix. Yes, Box et al. is genuinely useful, especially to people who are designing stuff. In the broad sequence of science/engineering, the steps are 0. have no idea. 1. have an idea that might be right. 2. confirm idea. 3. use confirmed idea to do something useful. Box, I gathered, was much more familiar with the end of this sequence (where interactions really matter) than the beginning (where they don’t matter at all). If the book had been titled Statistics For Engineers I wouldn’t be complaining. Their conventional emphasis on “hypothesis testing” is actually misleading in two ways: it disregards idea generation and it is a poor name for idea development, which much of their book is about.

      John Tukey’s Exploratory Data Analysis is a good guide to data analysis that will help you generate ideas. No one, as far as I know, has ever written at length about how to design research (e.g., experiments) to generate ideas.

  2. William Hunter was my father. He did many experiments. George Box did many experiments. You are entitled to your opinions obviously but the claim that they only dealt with other people data is not accurate.

    It is true they were world renowned experts on experimenting and had many people consult them about their experiments, for help: designing them, analyzing them, what to do next, how to improve the process of experimentation in their organization, etc.. While it seems to be implied in the post that such consultation was a reason to distrust their thoughts on experimentation I hardly think that is a sensible conclusion to draw. Most of those they helped were running experiments in industry, to improve results (not to publish papers).

    They were and are applied statistician. What experiments need to be done is critical for an applied statistician. What matters is making improvement in real world processes. If you don’t run the right experiments, you won’t learn things to help you improve.

    They worked on the problem of where to focus to learn quite a bit. One significant part of there belief was to have those involved in the work do the thinking about what needed to be improved. This isn’t tremendously radical today but in the past you had many people that thought “workers” should do what the college graduates in their office at headquarters tell them to do. Here is one of many such example, from

    “The key is that employees at all levels must have appropriate technical tools so that they can do the following things:

    – recognize when a problem has arisen or an opportunity for improvement exists,
    – collect relevant data,
    – analyze the situation,
    – determine whose responsibility it is to take further action,
    – solve the problem or refer it to someone more appropriate…”

    I don’t have the book in front of me, but doesn’t it start with an example on learning where you can use inductive reasoning and from the facts that you see you can draw conclusions and construct a theory that fits the facts. If so, it seems to call into question the idea that they claimed “[the] opposite of you can learn from any observation.” is not actually accurate. They understood you can use inductive reasoning to create theories. You then use experiments to test theories.

    The books is called Statistics for Experimenters, right? Not statistics for drawing conclusions when not doing experiments. When you are experimenting you can test whether beliefs you have are accurate and you ca learn about things you try. Smart people can make guesses what will happen and be right. I know the authors would believe those knowledgable about the system in question are well suited to determine what variables to test. It is that knowledge that will lead to experiments that are likely to be effective.

    The authors of the book were trying to help those that often failed to learn as much from experiments as they could. Far too many people still don’t use the most effective statistical tools when experimenting.

    They emphasized, consistently, the need for those doing the work to involved in the experiments. The job of statisticians was to help in the cases where advanced statistical tools and knowledge would be useful. The reason for those who do the work (are familiar with the process) is because they have knowledge to bring to what should be tried in experiments.

    When I read through The Scientific Context of Quality Improvement, 1987 by George Box and Soren Bisgaard it seems to me it discusses the types of issues you raise: how do we learn without experimenting? I am not sure if it is just me, or if it clearly addresses that issue. Here is another Statistics as a Catalyst to Learning by Scientific Method by George E. P. Box. Here is another Statistics for Discovery. There are many other sources, I am sure. They understood the importance of learning as much as you could from available sources. They just also understood the importance of experiments and learning the most you could from experiments. And the book, Statistics for Experimenters, was focused on the most effective ways to improve using statistics to learn from experiments.

    Here is what Box, said in his own words about the objective (and it isn’t proving the hypothesis)

    [too many people ]”can’t really get the fact that it’s not about proving a theorem, it’s about being curious about things. There aren’t enough people who will apply [DOE] as a way of finding things out”

    Statistics for Experimenters: Design, Innovation, and Discovery shows that the goal of design of experiments is to learn and refine your experiment based on the knowledge you gain and experiment again. It is a process of discovery. That discovery is useful when it allows you to make improvement in real world outcomes. That is the objective.

    1. John Hunter, thank you for your comments and links. I’m sorry that I don’t have time to revisit the whole subject at length right now and I imagine that anyone reading this can judge for themselves whether I am being fair. A briefer comment on George Box’s contribution is this: There are three stages of scientific/engineering inquiry: 1. discovery (finding ideas to take seriously). Above all, this involves discovery of new cause-effect relationships. 2. testing (distinguishing among the ideas created in the first stage). 3. development (using ideas from stage 2 to create useful products). Box was an expert at Stage 3; he seems to have known little about Stage 1. All the detailed examples in his work, including the papers you link to, come from Stage 3. In Stage 3, interactions are what matter most — the main effects were figured out in Stage 2. In Stage 1, interactions are unimportant because the main effects haven’t yet been figured out.

      The best tools for Stage 3 are quite different from the best tools for Stage 1 — which is why Statistics for Experiments is so incomplete for anyone who wants to do Stage 1. In contrast, Exploratory Data Analysis was heavily focussed on Stage 1. A statistician once said to me that no one actually used Tukey’s ideas. It was true that for Stage 2 and especially Stage 3, Tukey’s ideas were indeed less useful. Stage 2 and especially Stage 3 are more remunerative than Stage 1 so there are far more people working there, as far as I can tell.

  3. Thanks for your reply. I realize you can’t try to respond to every person who comments in detail, I just wanted to make some points I think are important. I’m not so sure Box expertise is limited to stage 3, but that is certainly something people can decide.

    I think the idea that they were disconnected from the real world improvement results is the thing I feel strongly against. They focused precisely on how to improve. They did not care for elegant statistical models that didn’t actually result in better results in the real world.

    And they fought the heavy emphasis on math-statistics that existed and exists – contrasted to applied statistics. They were (in my biased opinion) in the lead for promoting (what is still a minority) statisticians actually working with people on real world improvements as what matters. As part of that they completely understood the systems involved that include generating ideas as a very important component. It is also true they didn’t write about that aspect as much. Though they did write about it quite a bit (it seems to me), but it might not be as in your face as the writing on using statistics and experiments to continually improve results. And it is true they had less new knowledge to share with the world in that area I think.

  4. John Hunter, you are right, to lump Box in with the many many statisticians who had less real-world experience than he did is misleading. By calling him a “theorist” I probably did that. He was a theorist in the sense that the title of his book seemed to imply that it applied to all experimentation, whereas it did not — for example, it said nothing useful about how to do experiments that generate ideas in the Stage 1 of science I describe above. But thank you for writing again to clarify this. The reason I read Box, Hunter and Hunter in the first place was that it was more applications-oriented and realistic than a dozen other books on the subject.

Comments are closed.