Taking the P out of Psychology

Earlier this week the American Psychological Association issued a statement clarifying some basic facts of psychology that statisticians and other scientists keep getting wrong.

No, wait, I might have that backwards.  It was the American Statistical Association that issued the statement, comprising six principles about the use of p values. It was aimed at scientists (social or otherwise) who all bristled with irritation at the ASA's condescension until they got to number two on the list and surreptitiously Googled the difference between "the probability of the data given the hypothesis" and "the probability of the hypothesis given the data".

All those whose scientific livelihood depends on producing p values below .05 are aware of the craziness of the situation. A study supported by a statistic with a p value of .049 is lauded as a breakthrough worthy of publication. But a p value of .051 means you're a terrible scientist churning out junk. Worryingly, this is not a huge exaggeration. Flick through a contemporary psychology journal and try to find a results section that doesn't include the magic p<.05.

When genius eugenicist Sir Ronald Fisher published Statistical Methods for Research Workers in 1925 he would have been astonished to learn that thousands of people nearly a century later would still be slavishly following his rules of thumb.

Ronald Fisher with a faraway look in his eye. Or has he spotted someone with a lesser "innate capacity for intellectual and emotional development" ?

Ronald Fisher with a faraway look in his eye. Or has he spotted someone with a lesser "innate capacity for intellectual and emotional development" ?

And rules of thumb they were. He considered a critical p value of .05 to be sensible, given that it was about two standard deviations from the mean (in a 2-tailed test with a normal distribution). But he was also clear that while this was a "test of significance", what that actually meant was that if replications of the study achieved similar p values, you were probably on to something.

To implement Fisher's method you would calculate a statistic and then consult the tables at the back of his book to see if your value was greater than the one given for a p value of .05. If it was, you could celebrate by writing p<.05 in your lab report, along with the superfluous "and thus we may reject the null hypothesis".

The reason for using the term p<.05 was that you didn't know the exact value of p, because that would have been too difficult to calculate with only a slide rule and ten fingers, unless your name was Ronald Fisher. This was acceptable at the time of Downton Abbey. It was understandable even 50 years ago when statistical computing machines were operated with punchcards by earnest, pipe smoking men. For bizarre reasons it was still the practice 25 years ago when I learned statistics the hard way because psychologists didn't yet trust new-fangled computer programs like SPSS (now approaching its 50th birthday). But in 2016 it's absurd.

Can we all agree to dispense with p<.05 ?

And once we've thrown out that bathwater, we can throw out the baby of pass/fail hypothesis testing. We still teach students that if p<.05 they can happily reject their null hypothesis. Instead we should be teaching them how to accumulate and evaluate different types of evidence that support a hypothesis. Statistical probabilities are one part of that evidence. But so are the magnitudes of effect sizes. And how we interpret these depends on what's being measured.

We poo-poo the idea of even rules of thumb for the magnitude of correlation coefficients or effect sizes. I have a slide I use in statistics lectures which states that a correlation coefficient between 0.1 and 0.3 is "small", between 0.3 and 0.5 is "medium" and greater than 0.5 is "large". I then wait for the students to note this down before an overwrought PowerPoint animation crosses this out, replacing it with the po-faced statement "There is no objective interpretation of coefficient magnitude; it depends on the context". How they laugh. 

The special treatment of p values is evident in statistical software like SPSS which will happily annotate tables by adding asterisks in varying number to indicate that the p value it has calculated to dozens of decimal places is below an arbitrary threshold.

If you calculate a vast matrix of correlation coefficients you can easily spot those that are "statistically significant", regardless of whether the correlations themselves are large. The negative effect of prioritising the former over the latter leads to misleading research. For example, a study of the effects of violent video games on teenagers' behaviour showed a statistically significant correlation between video game playing and engaging in physical fights. This correlation was so significant, it had three asterisks next to it.

By contrast, the correlation itself (.21) was so small, less than 5% of the variance in physical fights was accounted for by video game playing. At this point, most people would be thinking so what accounts for other 95%? Instead, the authors of the paper were thinking yay, we got three asterisks.

I propose that if we are to continue with the concept of statistical significance in relation to the probability of the data given the null hypothesis, we should start using the concept of statistical magnificence in relation to correlation coefficients (or indeed effect sizes in general).

A correlation greater than about 0.7 should be considered statistically magnificent, because it implies that one variable accounts for most of the variance (>50%) in the other.

And we should do away with asterisks to indicate levels of significance. Instead, we should have different symbols to indicate significance, magnificence and anything else to which our attention should be drawn.

I am confident the next version of SPSS will produce output like this:

SPSS Correlation Matrix

Alternatively, we could read the ASA guidelines and learn how to use probability, effect size and statistical power sensibly.