Andrew Gelman writes:
If I had to come up with one statistical tip that would be most useful to you–that is, good advice that’s easy to apply and which you might not already know–it would be to use transformations. Log, square-root, etc.–yes, all that, but more! I’m talking about transforming a continuous variable into several discrete variables (to model nonlinear patterns such as voting by age) and combining several discrete variables to make something [more] continuous (those “total scores” that we all love). And not doing dumb transformations such as the use of a threshold to break up a perfectly useful continuous variable into something binary. I don’t care if the threshold is “clinically relevant” or whatever–just don’t do it. If you gotta discretize, for Christ’s sake break the variable into 3 categories.
I agree (and wrote an article about it). Transforming data is so important that intro stats texts should have a whole chapter on it — but instead barely mention it. A good discussion of transformation would also include use of principal components to boil down many variables into a much smaller number. (You should do this twice — once with your independent variables, once with your dependent variables.) Many researchers measure many things (e.g., a questionnaire with 50 questions, a blood test that measures 10 components) and then foolishly correlate all independent variables with all dependent variables. They end up testing dozens of likely-to-be-zero correlations for significance. Thereby effectively throwing all their data away — when you do dozens of such tests, none can be trusted.
My explanation why this isn’t taught differs from Andrew’s. I think it’s pure Veblen: professors dislike appearing useful and like showing off. Statistics professors, like engineering professors, do less useful research than you might expect, so they are less aware than you might expect of how useful transformations are. And because most transformations don’t involve esoteric math, writing about them doesn’t allow you to show off.
In my experience, not transforming your data is at least as bad as throwing half of it away, in the sense that your tests will be that much less sensitive.