The Wisdom of Google: “Dessert”, “Honey” and “Fruit” Closer to “Dinner” than “Breakfast” or “Lunch”

I have blogged many times that bedtime honey improves sleep. I learned this from Stuart King, an Australian musician. He also pointed out we eat dessert with dinner more than with other meals. which others who have described the honey effect have not said. The dessert observation suggests that other sweets, not just honey, improve sleep. After I repeated the dessert observation, a friend said I of all people should know it isn’t universal. The Chinese don’t eat dessert, she said. Yes, I said, but where I lived in Beijing there seemed to be lots of sweets eaten in the evening, and lots of street vendors selling fruit in the evening. Continue reading “The Wisdom of Google: “Dessert”, “Honey” and “Fruit” Closer to “Dinner” than “Breakfast” or “Lunch””

Researchers Fool Themselves: Water and Cognition

A recent paper about the effect of water on cognition illustrates a common way that researchers overstate the strength of the evidence, apparently fooling themselves. Psychology researchers at the University of East London and the University of Westminster did an experiment in which subjects didn’t drink or eat anything starting at 9 pm and the next morning came to the testing room. All of them were given something to eat, but only half of them were given something to drink. They came in twice. On one week, subjects were given water to drink; on the other week, they weren’t given water. Half of the subjects were given water on the first week, half on the second. Then they gave subjects a battery of cognitive tests.

One result makes sense: subjects were faster on a simple reaction time test (press button when you see a light) after being given water, but only if they were thirsty. Apparently thirst slows people down. Maybe it’s distracting.

The other result emphasized by the authors doesn’t make sense: Water made subjects worse at a task called Intra-Extra Dimensional Set Shift. The task provided two measures (total trials and total errors) but the paper gives results only for total trials. The omission is not explained. (I asked the first author about this by email; she did not explain the omission.) On total trials, subjects given water did worse, p = 0.03. A surprising result: after persons go without water for quite a while, giving them water makes them worse.

This p value is not corrected for number of tests done. A table of results shows that 14 different measures were used. There was a main effect of water on two of them. One was the simple reaction time result; the other was the IED Stages Completed (IED = intra/extra dimensional) result. It is likely that the effect of water on simple reaction time was a “true positive” because the effect was influenced by thirst. In contrast, the IED Stages Completed effect wasn’t reliably influenced by thirst. Putting the simple reaction time result aside, there are 13 p values for the main effect of water; one is weakly reliable (p = 0.03).  If you do 20 independent tests, purely by chance one is likely to have p < 0.05 at least once even when there are no true effects. Taken together, there is no good reason to believe that water had main effects aside from the simple reaction time test. The paper would be a good question for an elementary statistics class (“Question: If 13 tests are independent, and there are no true effects present, how likely will at least one be p = 0.03 or better by chance? Answer: 1 – (0.97^13) = 0.33″). 

I wrote to the first author (Caroline Edmonds) about this several days ago. My email asked two questions. She replied but failed to answer the question about number of tests. Her answer was written in haste; maybe she will address this question later.

A better analysis would have started by assuming that the 14 measures are unlikely to be independent. It would have done (or used) a factor analysis that condensed the 14 measures into (say) three factors. Then the researchers could ask if water affected each of the three factors. Far fewer tests, far more independent tests, far harder to fool yourself or cherry-pick.

The problem here — many tests, failure to correct for this or do an analysis with far fewer tests — is common but the analysis I suggest is, in experimental psychology papers, very rare. (I’ve never seen it.) Factor analysis is taught as part of survey psychology (psychology research that uses surveys, such as personality research), not as part of experimental psychology.  In the statistics textbooks I’ve seen, the problem of too many tests and correction for/reduction of number of tests isn’t emphasized. Perhaps it is a research methodology example of Gresham’s Law: methods that make it easier to find what you want (differences with p < 0.05) drive out better methods.

Thanks to Allan Jackson.

Assorted Links

Thanks to Bryan Castañeda.

The Growth of Personal Science: Implications For Statistics

I have just submitted a paper to Statistical Science called “The Growth of Personal Science: Implications For Statistics”. The core of the paper is examples, mostly my work (on flaxseed oil, butter, standing, and so on). There is also a section on the broad lessons of the examples — what can be learned from them in addition to the subject-matter conclusions (e.g., butter makes me faster at arithmetic). The paper grew out of a talk I gave at the Joint Statistical Meetings a few years ago, as part of a session organized by Hadley Wickham, a professor of statistics at Rice University. Continue reading “The Growth of Personal Science: Implications For Statistics”

Assorted Links

Usual Drug Trial Analyses Insensitive to Rare Improvement

In a comment on an article in The Scientist, someone tells a story with profound implications:

I participated in 1992 NCI SWOG 9005 Phase 3 [clinical trial of] Mifepristone for recurrent meningioma. The drug put my tumor in remission when it regrew post surgery. However, other more despairing patients had already been grossly weakened by multiple brain surgeries and prior standard brain radiation therapy which had failed them before they joined the trial.  They were really not as young, healthy and strong as I was when I decided to volunteer for a “state of the art” drug therapy upon my first recurrence.  . . .  I could not get the names of the anonymous members of the Data and Safety Monitoring committee who closed the trial as “no more effective than placebo”. I had flunked the placebo the first year and my tumor did not grow for the next three years I was allowed to take the real drug. I finally managed to get FDA approval to take the drug again in Feb 2005 and my condition has remained stable ever since according to my MRIS.

Apparently the drug did not work for most participants in the trial — leading to the conclusion “no mnore effective than placebo” — but it did work for him.

The statistical tests used to decide if a drug works are not sensitive to this sort of thing — most patients not helped, a few patients helped. (Existing tests, such as the t test, work best with normality of both groups, treatment and placebo, whereas this outcome produces non-normality of the treatment group, which reduces test sensitivity.) It is quite possible to construct analyses that would be more sensitive to this than existing tests, but this has not been done. It is quite possible to run a study that produces for each patient a p value for the null hypothesis of no effect (a number that helps you decide if that particular patient has been helped) but this too has not been done.

Since these new analyses would benefit drug companies, their absence is curious.

Gene Linked to Autism?

An article in the New York Times describes research that supposedly linked a rare gene mutation to autism:

Dr. Matthew W. State, a professor of genetics and child psychiatry at Yale, led a team that looked for de novo mutations [= mutations that are not in the parents] in 200 people who had been given an autism diagnosis, as well as in parents and siblings who showed no signs of the disorder. The team found that two unrelated children with autism in the study had de novo mutations in the same gene — and nothing similar in those without a diagnosis.

“That is like throwing a dart at a dart board with 21,000 spots and hitting the same one twice,” Dr. State said. “The chances that this gene is related to autism risk is something like 99.9999 percent.”

It is like throwing 200 darts at a dart board with 21,000 spots (the number of genes) and hitting the same one twice. (Each person has about 1 de novo mutation.) What are the odds of that? If all spots are equally likely to be hit, then the probability is about 0.6. More likely than not. (Dr. State seems to think it is extremely unlikely.) This is a variation on the birthday paradox. If there are 23 people in a room, it is 50/50 that two of them will share a birthday.

When Dr. State says, “The chances that this gene is related to autism risk is something like 99.9999 percent,” he is making an elementary mistake. He has taken a very low p value (maybe 0.000001) from a statistical test to indicate the likelihood that the null hypothesis (no association with autism) is true. P values indicate strength of evidence, not probability of truth.

One way to look at the evidence is that there is a group of 200 people (with an autism diagnosis) among whom two have a certain mutation and another group of about 600 people (their parents and siblings) none of whom have that mutation. If two instances of the mutation were randomly distributed among 800 people what are the odds that both instances would be in any pre-defined group of 200 of the 800 people (defined, say, by the letters in their first name)? The chance of this happening is 1/16. Not strong evidence of an association between the mutation and the actual pre-defined group (autism diagnosis).

Another study published at the same time found an link between autism and a mutation in the same gene identified by Dr. State’s group but again the association was weak. It may be a more subtle example of the birthday paradox: If twenty groups of genetics researchers are looking for a gene linked to autism, what are the odds that two of them will happen upon the same gene by chance?

If the gene with the de novo mutations is actually linked to autism, then we will have insight into the cause of 1% of the 200 autism cases Dr. Smart’s group studied. When genetics researchers try so hard and come up with so little, it increases my belief that the main causes of autism are environmental.

Thanks to Bryan Castañeda.

“Seth, How Do You Track and Analyze Your Data?”

A reader asks:

I haven’t found much on your blog commenting on tools you use to track your data. Any recommendations? Have you tried smart phones? For example, I have tried tracking fifteen variables daily via the iPhone app Moodtracker, the only one I found that can track and graph multiple variables and also give you automated reminders to submit data. There are other variants (Data Logger, Daytum) that will graph one variable (say, miles run per day), but Moodtracker is the only app I’ve found that lets you analyze multiple variables.

I use R on a laptop to track and analyze my data.  I write functions for doing this — they are not built-in. This particular reader hadn’t heard of R. It is free and the most popular software among statisticians. It has lots of built-in functions (although not for data collection — apparently statisticians rarely collect data) and provides lots of control over the graphs you make, which is very important. R also has several programs for fitting loess curves to your data. Loess is a kind of curve-fitting. There is a vast amount of R-related material, including introductory stuff, here.

To give an example, after I weigh myself each morning (I have three scales), I enter the three weights into R, which stores them and makes a graph. That’s on the simple side. At the other extreme are the various mental tests I’ve written (e.g., arithmetic) to measure how well my brain is working. The programs for doing the test are in R, the data is stored in R, and analyzed with R.

The analysis possibilities (e.g., the graphs you can make, your control over those graphs) I’ve seen on smart phone apps are hopelessly primitive for what I want to do. The people who write the analysis software seem to know almost nothing about data analysis. For example, I use a website called RankTracer to track the Amazon ranking of The Shangri-La Diet. Whoever wrote the software is so clueless the rank versus time graphs don’t even show log ranks.

I don’t know what the future holds. In academic psychology, there is near-total reliance on statistical packages (e.g., SPSS) that are so limited perhaps they can extract only half of the information in the usual data. There are many graphs you’d like to make that it is impossible to make. SPSS may not even have loess, for example. Yet I see no sign of this changing. Will personal scientists want to learn more from their data than psychology professors (and therefore be motivated to go beyond pre-packaged analyses)? I don’t know.

Causal Reasoning in Science: Don’t Dismiss Correlations

In a paper (and blog post), Andrew Gelman writes:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

Box, Hunter, and Hunter (1978) (a book called Statistics for Experimenters) is well-regarded by statisticians. Perhaps Box, Hunter, and Hunter, and Andrew, were/are unfamiliar with another quote (modified from Beveridge): “Everyone believes an experiment except the experimenter; no one believes a theory except the theorist.” Continue reading “Causal Reasoning in Science: Don’t Dismiss Correlations”

The Problem with Evidence-Based Medicine

In a recent post I said that med school professors cared about process (doing things a “correct” way) rather than result (doing things in a way that produces the best possible outcomes). Feynman called this sort of thing “cargo-cult science“. The problem is that there is little reason to think the med-school profs’ “correct” way (evidence-based medicine) works better than the “wrong” way it replaced (reliance on clinical experience) and considerable reason to think it isn’t obvious which way is better.

After I wrote the previous post, I came across an example of the thinking I criticized. On, during a conversation between Peter Lipson (a practicing doctor) and Isis The Scientist (a “physiologist at a major research university” who blogs at ScienceBlogs), Isis said this:

I had an experience a couple days ago with a clinician that was very valuable. He said to me, “In my experience this is the phenomenon that we see after this happens.” And I said, “Really? I never thought of that as a possibility but that totally fits in the scheme of my model.” On the one hand I’ve accepted his experience as evidence. On the other hand I’ve totally written it off as bullshit because there isn’t a p value attached to it.

Isis doesn’t understand that this “p value” she wants so much comes with a sensitivity filter attached. It is not neutral. To get it you do extensive calculations. The end result (the p value) is more sensitive to some treatment effects than others in the sense that some treatment effects will generate smaller (better) p values than other treatment effects of the same strength, just as our ears are more sensitive to some frequencies than others.

Our ears are most sensitive around the frequency of voices. They do a good job of detecting what we want to detect. What neither Isis nor any other evidence-based-medicine proponent knows is whether the particular filter they endorse is sensitive to the treatment effects that actually exist. It’s entirely possible and even plausible that the filter that they believe in is insensitive to actual treatment effects. They may be listening at the wrong frequency, in other words. The useful information may be at a different frequency.

The usual statistics (mean, etc.) are most sensitive to treatment effects that change each person in the population by the same amount. They are much less sensitive to treatment effects that change only a small fraction of the population. In contrast, the “clinical judgment” that Isis and other evidence-based-medicine advocates deride is highly sensitive to treatments that change only a small fraction of the population — what some call anecdotal evidence. Evidence-based medicine is presented as science replacing nonsense but in fact it is one filter replacing another.

I suspect that actual treatment effects have a power-law distribution (a few helped a lot, a large fraction helped little or not at all) and that a filter resembling “clinical judgment” does a better job with such distributions. But that remains to be seen. My point here is just that it is an empirical question which filter works best. An empirical question that hasn’t been answered.

Does Lithium Slow ALS?

In 2008, an article in Proceedings of the National Academy of Sciences (PNAS) reported that lithium had slowed the progression of amyotrophic lateral sclerosis (ALS), which is always fatal. This article describes several attempts to confirm that effect of lithium. Three studies were launched by med school professors. In addition, patients at PatientsLikeMe also organized a test.

One of Nassim Taleb’s complaints about finance professors is their use of VAR (value at risk)  to measure the riskiness of investments. It’s still being taught at business schools, he says. VAR assumes that fluctuations have a certain distribution. The distributions actually assumed turned out to grossly underestimate risk. VAR has helped many finance professionals take risks they shouldn’t have taken. It would have been wise for finance professors to wonder how well VAR does in practice, thereby to judge the plausibility of the assumed distribution. This might seem obvious. Likewise, the response to the PNAS paper revealed two problems that might seem obvious:

1. Unthinking focus on placebo controls. It would have been progress to find anything that slows ALS. Anything includes placebos. Placebos vary. From the standpoint of those with ALS, it would have been better to compare lithium to nothing than to some sort of placebo. As far as I can tell from the article, no med school professor realized this. No doubt someone has said that the world can be divided into people focused on process (on doing things a certain “right” way) and those focused on results (on outcomes). It should horrify all of us that med school professors appear focused on process.

2. Use of standard statistics (e.g., mean) to measure drug effects. I have not seen the ALS studies, but if they are like all other clinical trials I’ve seen, they tested for an effect by comparing means using a parametric test (e.g., a t test). However, effects of treatment are unlikely to have normal distributions nor are likely to be the same for each person. The usual tests are most sensitive when each member of the treatment group improves the same amount and the underlying variation is normally distributed. If 95% of the treatment group is unaffected and 5% show improvement, for example, the usual tests wouldn’t do the best job of noticing this. If medicine A helps 5% of patients, that’s an important improvement over 0%, especially with a fatal disease. And if you take it and it doesn’t help, you stop taking it and look elsewhere. So it would be a good idea to find drugs that only help a fraction of patients, perhaps a small fraction. The usual analyses may have caused drugs that help a small fraction of patients to be considered worthless when they could have been detected.

All the tests of lithium, including the PatientsLikeMe test, turned out negative. The PatientsLikeMe trial didn’t worry about placebo effects, so my point #1 isn’t a problem. However, my point #2 probably applies to all four trials.

Thanks to JR Minkel and Melissa Francis.

Unlikely Data

Connoisseurs of scientific fraud may enjoy David Grann’s terrific article about an art authenticator in the current New Yorker and this post about polling irregularities. What are the odds that two such articles would appear at almost the same time?

I suppose I’m an expert, having published several papers about data that was too unlikely. With Saul Sternberg and Kenneth Carpenter, I’ve written about problems with Ranjit Chandra’s work. I also wrote about problems with some learning experiments.

Beijing Street Vendors: What Color Market?

Black market = illegal. Grey market = “the trade of a commodity through distribution channels . . . unofficial, unauthorized, or unintended.”

In the evening, near the Wudaokou subway station in Beijing (where lots of students live), dozens of street vendors sell paperbacks ($1 each), jewelry, dresses, socks, scarves, electronic accessories, fruit, toys, shoes, cooked food, stuffed animals, and many other things. No doubt it’s illegal. When a police car approaches, they pick up and leave. Once I saw a group of policemen confiscate a woman’s goods.

What’s curious is how far vendors move when police approach. Once I saw the vendors on a corner, all 12 of them, each with a cart, move to the middle of the intersection — the middle of traffic — where they clustered. At the time I thought the traffic somehow protected them. Now I think they wanted to move back fast when the police car went away. Tonight, like last night, there’s a police car at that corner, the northeast corner of the intersection. No vendors there. The vendors who’d usually be there were now at the northwest corner. In other words, if a policeman got out of his car and walked across the street, he’d encounter all the vendors that he’d displaced.

Can John Gottman Predict Divorce With Great Accuracy?

Andrew Gelman blogged about the research of John Gottman, an emeritus professor at the University of Washington, who claimed to be able to predict whether newlyweds would divorce within 5 years with greater than 90% accuracy. These predictions were based on brief interviews near the time of marriage. Andrew agreed with another critic who said these claims were overstated. He modified Gottman’s Wikipedia page to reflect those criticisms. Andrew’s modifications were removed by someone who works for the Gottman Institute.

Were the criticisms right or wrong? The person who removed reference to them in Wikipedia referred to a FAQ page on the Gottman Institute site. Supposedly they’d been answered there. The criticism is that the “predictions” weren’t predictions: they were descriptions of how closely a model fitted after the data were collected could fit the data. If the model were complicated enough (had enough adjustable parameters), it could fit the data perfectly, but that would be no support for the model — and not “100% accurate prediction” as most people understand it.

The FAQ page says this:

Six of the seven studies have been predictive—each began with a hypothesis about factors leading to divorce. [I think the meaning is this: The first study figured out how to predict. The later six tested that method.] Based on these factors, Dr. Gottman predicted who would divorce, then followed the couples for a pre-determined length of time. Finally, he drew conclusions about the accuracy of his predictions. . . . This is true prediction.

This is changing the subject. The question is not whether Gottman’s research is any help at all, which is the question answered here; the question is whether he can predict at extremely high levels (> 90% accuracy), as claimed. Do the later six studies provide reasonable estimates of prediction accuracy? Presumably the latest ones are better than the earlier ones. The latest one (2002) was obviously not about accurate prediction estimates (its title used the term “exploratory”) so I looked at the next newest, published in 2000. Here’s what its abstract says:

A longitudinal study with 95 newlywed couples examined the power of the Oral History Interview to predict stable marital relationships and divorce. A principal components analysis of the interview with the couples (Time 1) identified a latent variable, perceived marital bond, that was significant in predicting which couples would remain married or divorce within the first 5 years of their marriage. A discriminant function analysis of the newlywed oral history data predicted, with 87.4% accuracy, those couples whose marriages remained intact or broke up at the Time 2 data collection point.

The critics were right. To say a discriminant function “predicted” something is to mislead those who don’t know what a discriminant function is. They don’t predict, they fit a model to data, after the fact. To call this “true prediction” is false.

To me, the “87.4%” suggests something seriously off. It is too precise; I would have written “about 90%”. It is as if you asked someone their age and they said they were “24.37 years old.”

Speaking of overstating your results, reporting bias in medical research. Thanks to Anne Weiss.

Andrew Gelman’s Top Statistical Tip

Andrew Gelman writes:

If I had to come up with one statistical tip that would be most useful to you–that is, good advice that’s easy to apply and which you might not already know–it would be to use transformations. Log, square-root, etc.–yes, all that, but more! I’m talking about transforming a continuous variable into several discrete variables (to model nonlinear patterns such as voting by age) and combining several discrete variables to make something [more] continuous (those “total scores” that we all love). And not doing dumb transformations such as the use of a threshold to break up a perfectly useful continuous variable into something binary. I don’t care if the threshold is “clinically relevant” or whatever–just don’t do it. If you gotta discretize, for Christ’s sake break the variable into 3 categories.

I agree (and wrote an article about it). Transforming data is so important that intro stats texts should have a whole chapter on it — but instead barely mention it. A good discussion of transformation would also include use of principal components to boil down many variables into a much smaller number. (You should do this twice — once with your independent variables, once with your dependent variables.) Many researchers measure many things (e.g., a questionnaire with 50 questions, a blood test that measures 10 components) and then foolishly correlate all independent variables with all dependent variables. They end up testing dozens of likely-to-be-zero correlations for significance. Thereby effectively throwing all their data away — when you do dozens of such tests, none can be trusted.

My explanation why this isn’t taught differs from Andrew’s. I think it’s pure Veblen: professors dislike appearing useful and like showing off. Statistics professors, like engineering professors, do less useful research than you might expect, so they are less aware than you might expect of how useful transformations are. And because most transformations don’t involve esoteric math, writing about them doesn’t allow you to show off.

In my experience, not transforming your data is at least as bad as throwing half of it away, in the sense that your tests will be that much less sensitive.

Exploratory Versus Confirmatory Data Analysis?

In 1977, John Tukey published a book called Exploratory Data Analysis. It introduced many new ways of analyzing data, all relatively simple. Most of the new ways involved plotting your data. A few involved transforming your data. Tukey’s broad point was that statisticians (taught by statistics professors) were missing a lot: Conventional statistics focussed too much on confirmatory data analysis (testing hypotheses) to the omission of exploratory data analysis — data analysis that might show you something new. Here are some tools to help you explore your data, Tukey was saying.

No question the new tools are useful. I have found great benefits from plotting and transforming my data. No question that conventional statistics textbooks place far too little emphasis on graphs and transformations. But I no longer agree with Tukey’s exploratory versus confirmatory distinction. The distinction that matters — at least to historians, if not to data analysts — is between low-status and high-status. A more accurate title of Tukey’s book would have been Low-Status Data Analysis. Exploratory data analysis already had a derogatory name: Descriptive data analysis. As in mere description. Graphs and transformations are low-status. They are low-status because graphs are common and transformations are easy. Anyone can make a graph or transform their data. I believe they were neglected for that reason. To show their high status, statistics professors focused their research and teaching on more difficult and esoteric stuff — like complicated regression. That the new stuff wasn’t terribly useful (compared to graphs and transformations) mattered little. Like all academics — like everyone — they cared enormously about showing high status. It was far more important to be impressive than to be useful. As Veblen showed, it might have helped that the new stuff wasn’t very useful. “Applied” science is lower status than “pure” science.

That most of what statistics professors have developed (and taught) is less useful than graphs and transformations strikes me as utterly clear. My explanation is that in statistics, just as in every other academic area I know about, desire to display status led to a lot of useless highly-visible work. (What Veblen called conspicuous waste.) Less visibly, it led to the best tools being neglected. Tukey saw the neglect –  underdevelopment and underteaching of graphs, for example — but perhaps misdiagnosed the cause. Here’s why Tukey’s exploratory versus confirmatory distinction was misleading: Because the tools that Tukey promoted for exploration also improve confirmation. They are neglected everywhere. For example:

1. Graphs improve confirmatory data analysis. If you do a t test (or compute a p value in any way) but don’t make an associated graph, there is room for improvement. A graph will show whether the assumptions of the computation are reasonable. Often they aren’t.

2. Transformations improve confirmatory data analysis. That a good transformation will make the assumptions of the test more reasonable many people know. What few people seem to know is that a good transformation will make the statistical test more sensitive. If a difference exists, the test will be more likely to detect it. This is like increasing your sample size at no extra cost.

3. Exploratory data analysis is sometimes thought of as going beyond the question you started with to find other structure in the data — to explore your data. (Tukey saw it this way.) But to answer the question you started with as well as possible you should find all the structure in the data. Suppose my question is whether X has an effect.  I should care whether Y and Z have an effect in order to (a) make my test of X more sensitive (by removing the effects of Y and Z) and (b) assess the generality of the effect of X (does it interact with Y or Z?).

Most statistics professors and their textbooks have neglected all uses of graphs and transformations, not just their exploratory uses. I used to think exploratory data analysis (and exploratory science more generally) needed different tools than confirmatory data analysis and confirmatory science. Now I don’t. A big simplification.

Exploration (generating new ideas) and confirmation (testing old ideas) are outputs of data analysis, not inputs. To explore your data and to test ideas you already have you should do exactly the same analysis. What’s good for one is good for the other.

Likewise, Freakonomics could have been titled Low-status Economics. That’s essentially what it was, the common theme. Levitt studied all sorts of things other economists thought were beneath them to study. That was Levitt’s real innovation — showing that these questions were neglected. Unsurprisingly, the general public, uninterested in the status of economists, found the work more interesting than high-status economics. I’m sensitive to this because my self-experimentation was extremely low-status. It was useful (low-status), cheap (low-status), small (low-status), and anyone could do it (extremely low status).

More Andrew Gelman comments. Robin Hanson comments.

How Much Should We Trust Clinical Trials?

Suppose you ask several experts how to choose a good car. Their answers reveal they don’t know how to drive. What should you conclude? Suppose these experts build cars. Should we trust the cars they’ve built?

Gina Kolata writes that “experts agree that there are three basic principles that underlie the search for medical truth and the use of clinical trials to obtain it.” Kolata’s “three basic principles” reveal that her experts don’t understand experimentation.

Principle 1.  “It is important to compare like with like. The groups you are comparing must be the same except for one factor — the one you are studying. For example, you should compare beta carotene users with people who are exactly like the beta carotene users except that they don’t take the supplement.” An expert told her this. This — careful equation of two groups — is not how experiments are done. What is done is random assignment, which roughly (but not perfectly) equates the groups on pre-experimental characteristics. A more subtle point is that the X versus No X design is worse than a design that compares different dosages of X. The latter design makes it less likely that control subjects will get upset because they didn’t get X and makes the two groups more equal.

Principle 2. “The bigger the group studied, the more reliable the conclusions.” Again, this is not what happens. No one with statistical understanding judges the reliability of an effect by the size of the experiment; they judge it by the p value (which takes account of sample size). The more subtle point is that the smaller the sample size, the stronger the effect must be to get reliable results. Researchers try to conserve resources so they try to keep experiments as small as possible. Small experiments with reliable results are more impressive than large experiments with equally reliable results — because the effect must be stronger. This is basically the opposite of what Kolata says.

Principle 3. In the words of Kolata’s expert, it’s “Bayes theorem”. He means consider other evidence — evidence from other studies. This is not only banal, it is meaningless. It is unclear — at least from what Kolata writes — how to weigh the various sources of evidences (what if the other evidence and the clinical trials disagree?).

Kolata also quotes David Freedman, a Berkeley professor of statistics who knew the cost of everything and the value of nothing. Perhaps it starts in medical school. As I blogged, working scientists, who have a clue, don’t want to teach medical students how to do research.

If this is the level of understanding of the people who do clinical trials, how much should we trust them? Presumably Kolata’s experts were better than average — a scary thought.

Will Like vs. Might Love vs. Might Hate

What to watch? Entertainment Weekly has a feature called Critical Mass: Ratings of 7 critics are averaged. Those averages are the critical response that most interests me. Rotten Tomatoes also computes averages over critics. It uses a 0-100 scale. In recent months, my favorite movie was Gran Torino, which rated 80 at Rotten Tomatoes (quite good). Slumdog Millionaire, which I also liked, got a 94 (very high).

Is an average the best way to summarize several reviews? People vary a lot in their likes and dislikes — what if I’m looking for a movie I might like a lot? Then the maximum (best) review might be a better summary measure; if the maximum is high, it means that someone liked the movie a lot. A score of 94 means that almost every critic liked Slumdog Millionaire, but the more common score of 80 is ambiguous: Were most critics a bit lukewarm or was wild enthusiasm mixed with dislike? Given that we have an enormous choice of movies — especially on Rotten Tomatoes — I might want to find five movies that someone was wildly enthusiastic about and read their reviews. Movies that everyone likes (e.g., 94 rating) are rare.

Another possibility is that I’m going to the movies with several friends and I just want to make sure no one is going to hate the chosen movie. Then I’d probably want to see the minimum ratings, not the average ratings.

So: different questions, wildly different “averages”. I have never heard a statistician or textbook make this point except trivially (if you want the “middle” number choose the median, a textbook might say).  The possibility of “averages” wildly different from the mean or median is important because averaging is at the heart of how medical and other health treatments are evaluated. The standard evaluation method in this domain is to compare the mean of two groups — one treated, one untreated (or perhaps the two groups get two different treatments).

If there is time to administer only one treatment, then we probably do want the treatment most likely to help. But if there are many treatments available and there is time to administer more than one treatment — if the first one fails, try another, and so on — then it is not nearly so obvious that we want the treatment with the best mean score. Given big differences from person to person, we might want to know what treatments worked really well with someone. Conversely, if we are studying side effects, we might want to know which of two treatments was more likely to have extremely bad outcomes. We would certainly prefer a summary like the minimum (worst) to a summary like the median or mean.

Outside of emergency rooms, there is usually both a wide range of treatment choice and plenty of time to try more than one. For example, you want to lower your blood pressure. This is why medical experts who deride “anecdotal evidence” are like people trying to speak a language they don’t know — and don’t realize they don’t know. (Their cluelessness is enshrined in a saying: the plural of anecdote is not data.) In such situations, extreme outcomes, even if rare, become far more important than averages. You want to avoid the extremely bad (even if rare) outcomes, such as antidepressants that cause suicide. And if a small fraction of people respond extremely well to a treatment that leaves most people unchanged, you want to know that, too. Non-experts grasp this, I think. This is why they are legitimately interested in anecdotal evidence, which does a better job than means or medians of highlighting extremes. It is the medical experts, who have read the textbooks but fail to understand their limitations, whose understanding has considerable room for improvement.