John Allen Paulos explains how the bell curve works. The bell curve, you will recall, is given by the equation

,

where is the average value of the variable (the peak of the bell curve) and is the variance. Paulos’s point is, apparently, that small differences in the values of and can lead to extreme imbalances at the far ends of the curve ( for large values of R). Here is how this might manifest itself in practice:

The corporation’s personnel officer notes the relatively small differences between the groups’ means and observes with satisfaction that the many mid-level positions are occupied by both Mexicans and Koreans.She is puzzled, however, by the preponderance of Koreans assigned to the relatively few top jobs, those requiring an exceedingly high score on the qualifying test. The personnel officer does further research and discovers that most holders of the comparably few bottom jobs, assigned to applicants because of their very low scores on the qualifying test, are Mexican.

She may suspect bias, but the result might just as well be an unforeseen consequence of the way the normal distribution works.

Yes, really. Of course, Paulos chose the direction of the imbalance at random. He says so right in the article.

There’s a way of misusing mathematics that goes like this: start with a mathematical model, often a probability distribution or a differential equation, that looks reasonable enough in typical circumstances. Then assume a very specific set of circumstances, for example making one of the parameters abnormally large, plug this into the general purpose equation, manipulate it for a bit, and draw the conclusions. QED, or something.

What’s missing from it is a level of mathematical maturity. In my experience of teaching undergraduate mathematics, manipulating exact formulas is the easy part for most students. (Relatively speaking, of course, but whatever.) The hard part is the inequalities, approximations, error estimates. You no longer have an exact equation that can be rearranged every which way and still remains equivalent to the original one. If you move around and rescale the terms in an approximate formula, the error might still be acceptable, or it might not be, and you can’t always tell which is which by just backtracking through an automated series of algebraic manipulations. You actually have to understand what’s going on.

The bell curve is given to us by the Central Limit Theorem as the asymptotic distribution of averages of large numbers of independent and identically distributed random variables. The words “limit”, “asymptotic”, “averages” and “large numbers” all refer to the basic principle that the stochastic laws rely on having a huge number of measurements available for analysis. The larger the population, the more accurate the probabilistic description is likely to be on average. Bell curves are field artillery, not surgical scalpels. They’re best at describing the middle to moderate extreme range that encompasses the majority of the population. Finding subtle patterns within the tiny samples at the far extreme ranges is not what they’re made for.

Sure, if the population is sufficiently large – say, all residents of North America – then even the far ends of the bell curve may be populated well enough to allow stochastic analysis. But then we run into a different problem: can the quantity of interest be defined and measured uniformly across such vast and diverse populations? How do we measure for example “math ability” independently of the social and economic context? How do we quantify it in a way that’s accurate throughout the full spectrum of abilities, from the math illiterates to the Harvard research stars? Certainly the x-variable should be an increasing function of ability, but then there are many increasing functions out there, and the image of a bell curve under a nonlinear change of variables is not a bell curve any longer. That’s before we even consider that the underlying assumption that “ability” is one-dimensional might well be too simplistic in the first place. It’s tempting to think of a single bell curve for every type of ability and to make fancy stochastic predictions based on that. That’s not mathematics, though. That’s wishful thinking.

In the real world, few random variables are completely “independent” and “identically distributed”. In the context of investigating cognitive abilities, gender bias and other such, this is implicitly acknowledged by every author who draws different bell curves for different populations… but still, all individuals within each identified population are IID, right?

Actually, even when the random variables are differently distributed and somewhat correlated, we still expect the limiting distribution to look like the bell curve, on the grounds that the errors are randomly distributed and their contribution averages out to 0 in the limit. Here again, though, we need to work with really large numbers. We also need to know that the errors are not all skewed in one direction, like for example when an ethnic group or a population segment gets singled out disproportionally often for casual suggestions placing them closer to the stupid end of the bell curve.

But the main deal breaker is conditioning on extremely low probability events. This doesn’t just introduce errors into the general purpose probability distribution; it makes that distribution about as relevant as last year’s snow. (Think zero-measure sets or lower-dimensional submanifolds.) If you think that getting to the top in management, academia or any other hierarchical profession depends only on “abilities”, then I have a bridge that I could sell you real cheap. There are events and factors that we’re conditioning on before we even get anywhere close to the top. There must be studies out there that actually try to get at some of these factors and quantify their influence. That would be messy and complicated, though. It doesn’t have the popular appeal of the claim that career success is determined by a simple universal formula with two parameters.

Let me tell you about a layman’s version of Szemeredi’s theorem that I like: given any desired statistical pattern, one can always find data to support it, provided only that a large enough pool of data is available to pick and choose from. It shouldn’t come as a surprise that all known (to me) proofs of Szemeredi’s theorem involve conditioning on unlikely events at some point. Szemeredi’s theorem doesn’t take sides, so if you’re looking for data to contradict the same pattern instead, you should be able to find that as well. In either case, you’ll be ignoring the vast majority of the data available to you.

Pingback: Weekly picks « Mathblogging.org developer.blog

I think you mean \sigma is the standard deviation (or else \sigma^2 is the variance).

You’re right, thanks. I’ve corrected that.