How frequently do you see in the headlines that scientists have discovered that tomato juice reduces the chances of Parkinson’s disease, that red wine does or does not reduce the risk of heart disease, or that salmon is good for your brain? While statements like these may well be true, they tend to come together as a random collection of disconnected datasets assessed using standard statistical tools.

Of course, therein lies at least one major rub inherent to this piecemeal approach. If I come up with twenty newsworthy illnesses and then devise one clinical trial to assess the effectiveness of some substance for fighting them, I am highly likely to come up with a statistically valid result. This is in fact true even if the substance I am providing does absolutely nothing. While the placebo effect could account for this, the more important reason is much more basic:

Statistical evaluation in clinical trials is done using a method called hypothesis testing. Let’s say I want to evaluate the effect of pomegranate juice on memory. I come up with two groups of volunteers and some kind of memory test, then give the juice to half the volunteers and a placebo that is somehow indistinguishable to the others. Then, I give out the tests and collect scores. Now, it is possible that – entirely by chance – one group will outperform the other, even if they are both randomly selected and all the trials are done double-blind. As such, what statisticians do is start with the hypothesis that pomegranate juice does nothing: this is called the null hypothesis. Then, you look at the data and decide how likely it is that you got the data you did, even if pomegranate juice does nothing. The more unlikely it is that your null hypothesis is false, given the data, the more likely the converse is true.

If, for instance, we gave this test to two million people, all randomly selected, and the ones who got the pomegranate juice did twice as well in almost every case, it would seem very unlikely that pomegranate juice has no effect. The question, then, is where to set the boundary between data that is consistent with the null hypothesis and data that allows us to reject it. For largely arbitrary reasons, it is usually set at 95%. That means, there is a chance of 5% or less that the null hypothesis is true – pomegranate juice does nothing – in spite of the data which seem to indicate the converse.

More simply, let’s imagine that we are rolling a die and trying to evaluate whether it is fair or not. If we roll it twice and get two sixes, we might be a little bit suspicious. If we roll it one hundred times and get all sixes, we will become increasingly convinced the die is rigged. It’s always possible that we keep getting sixes by random chance, the the probability falls with each additional piece of data we collect that indicates otherwise. The number of trials we do before the decide that the die is rigged is the basis for our confidence level.

^{1}The upshot of this, going back to my twenty diseases, is that if you do these kinds of studies over and over again, you will incorrectly identify a statistically significant effect 5% of the time. Because that’s the confidence level you have chosen, you will always get that many false positives (instances where you identify an effect that doesn’t actually exist). You could set the confidence level higher, but that requires larger and more expensive studies. Indeed, moving from 95% confidence to 99% of higher can often require increasing the sample size by one hundred-fold or more. That is cheap enough when you’re rolling dice, but it gets extremely costly when you have hundreds of people being experimented upon.

My response to all of this is to demand the presence of some comprehensible causal mechanism. If we test twenty different kinds of crystals to see if adhering one to a person’s forehead helps their memory, we should find that one in twenty works, based on a 95% confidence level. That said, we don’t have any reasonable scientific explanation of why this should be so. If we have a statistically established correlation but no causal understanding, we should be cautious indeed. Of course, it’s difficult to learn these kinds of things from the sort of news story I was describing at the outset.

[1] If you’re interested in the mathematics behind all of this, just take a look at the first couple of chapters of any undergraduate statistics book. As soon as I broke out any math here, I’d be liable to scare off the kind of people who I am trying to teach this to – people absolutely clever enough to understand these concepts, but who feel intimidated by them.

{ 12 comments… read them below or add one }

Null hypothesis testing has always been controversial. Many statisticians have pointed out that rejecting the null hypothesis says nothing or very little about the likelihood that the null is true. Under traditional null hypothesis testing, the null is rejected when P(Data | Null) is extremely unlikely, say 0.05. However, researchers are really interested in P(Null | Data) which cannot be inferred from a p-value. In some cases, P(Null | Data) approaches 1 while P(Data | Null) approaches 0, in other words, we can reject the null when it’s virtually certain to be true. For this and other reasons, Gerd Gigerenzer has called null hypothesis testing “mindless statistics” while Jacob Cohen describes it as a ritual conducted to convince ourselves that we have the evidence needed to confirm our theories.

Elizabeth Anscombe, a student of Wittgenstein, notes that “Tests of the null hypothesis that there is no difference between certain treatments are often made in the analysis of agricultural or industrial experiments in which alternative methods or processes are compared. Such tests are […] totally irrelevant. What are needed are estimates of magnitudes of effects, with standard errors.”

(Source)

Also, a very mathematical parallel to your die example.

If you are out hunting for ideas to debunk, I must recommend having a look at Greenie Watch, a blog written by Jon Ray (related to Gene, perhaps?)

To quote: “This site is in favour of things that ARE good for the environment. Most Greenie causes are, however, at best red-herrings and are more motivated by a hatred of people than anything else.

John Ray (M.A.; Ph.D.), writing from Brisbane, Australia”

That seems to go well with this: Pandas are endangered because they are utterly incompetent.

Statistics are, as the French say, “le boring!”

Check out what’s in the news today: Pomegranate Juice May Keep Prostate Cancer at Bay

There’s a TV shampoo advert at the moment – I forget what brand – claiming to leave hair “up to 100% dandruff free”. Still, at least they didn’t claim 300% or something…

If journalists frequently misuse statistics, advertisers have made a positive art of twisting them in the oddest of ways.

Dartmouth is giving away the much-praised textbook, Introduction to Probability by Charles M. Grinstead and J. Laurie Snell, as free etext. The website also includes computer programs to go along with the book.

Heads!

Heads.

Heads…

Precisely

On overfitting

Of course, the parabola looked liked it might be a better fit — after all, it is closer to more of the data points, which really do seem to be curving upwards. And we couldn’t have known in advance, could we? Well, we could have increased our chances of being right, by using some basic statistics (a chi-squared test for example). The results would have shown us that there were not enough degrees of freedom in the data to justify the use of the higher-order curve (which reduces the degrees of freedom by from 8 to 7 in this example). Choosing the parabola in this case is a classic example of overfitting.

The basic lesson here is that one should avoid using more parameters than necessary when fitting a function (or functions) to a set of data. Doing otherwise, more often than not, leads to large extrapolation errors.