You won’t get very far into any journal before you start reading about statistical significance, and its close sibling, 95% confidence intervals. But what do these terms mean, and how do they help us draw conclusions about studies?
Because this topic can be numbingly dry in the abstract, we’ll illustrate these basic concepts by reviewing the results of a paper chosen from the current issue of the American Journal of Psychiatry (Eranti S et al, A Randomized, Controlled Trial With 6-Month Follow-Up of Repetitive Transcranial Magnetic Stimulation and Electroconvulsive Therapy for Severe Depression 2007;164:73-81.)
From the title alone, you can ascertain that the study design is one of the better ones, that is, a randomized, controlled trial comparing ECT with rTMS for the treatment of severe depression. In this study, a total of 46 patients with severe depression were randomly assigned to either a 15-day course of rTMS (N=24) or a standard course of ECT (N=22). If you peer into the “Method” section, however, you will discover that it is neither double-blinded nor placebo-controlled. Not a perfect design, but a pretty good one.
Next, let’s go directly to the “Results” section (page 75), focusing specifically on the subheading “Primary Outcome:”
“Post hoc tests showed that the end-oftreatment HAM-D scores were significantly lower in the ECT group than in the rTMS group (F=10.89, df=1, 45, 95% CI for difference=3.40 to 14.05, p=0.002), demonstrating a strong standardized effect size of 1.44.”
It’s a mouthful, to be sure, but with some basic concepts in statistics you’ll be able to whip through such verbiage in no time. Let’s deconstruct the findings.
1. “Post hoc tests showed that the end-of-treatment HAM-D scores were significantly lower in the ECT group than in the rTMS group (F=10.89, df=1, 45, 95% CI for difference=3.40 to 14.05, p=0.002)….”
This means that statistics done after the results were tallied (“post hoc”) showed that patients who received ECT ended up with an average Hamilton depression score that was lower (meaning less depressed) than those patients who received rTMS. The stuff in parentheses is there to prove that this difference was “statistically significant.” Skip to the end of those numbers, and you see that “p=0.002.” Translation: the probability that this result might have occurred by chance alone (and therefore, is not a “real” finding) is 2 out of 1000, or 0.002, or only 0.2 %. The standard cut-off point for statistical significance is p=0.05, or a 5% probability that the results occurred by chance, so the results of this study are particularly “robust.”
You will often see studies in which results are reported like this: “the difference between Drug A and Drug B showed a trend toward statistical significance (p=0.06).” This means that the results didn’t quite meet the crucial 0.05 threshold, but they came close. Why is 5% the magic number? As befits an arbitrary number, its choice was also somewhat arbitrary. In 1926, R. A. Fisher, one of the fathers of modern statistics, wrote an article in which he argued that it was “convenient” to choose this cut-off point, for a variety of reasons having to do with standard deviations and the like (for more information, see Dallal GE, The Little Handbook of Statistical Practice, posted on the web at http://www.tufts.edu/~gdallal/LHSP.HTM). This number has stood the test of time throughout all the scientific disciplines. Why? Because it has some intuitive appeal.
Look at it this way: Before we accept a finding as scientific fact, we want to be pretty certain that it didn’t occur through some coincidence of random factors. But how certain is “pretty certain?” Would 80% certainty (p=0.2) be enough for you? Probably not. Most doctors would not feel comfortable basing important treatment decisions on only an 80% certainty that a treatment is effective. Much better would be 99% certainty (p=0.01), but if that were the required threshold we would have very little to offer our patients. It just so happens that 95% certainty has felt “right” to scientists through the last 50 years or so. Of course it’s arbitrary, but if we don’t agree on some threshold, we open ourselves up to researchers creating their own threshold values depending on how strongly they want to push acceptance of their data (some still do this anyway). Because the scientific community has settled upon p=0.05, the term “statistical significance” has a certain, well, significance!
That being said, you, as a reader and clinician, have every right to look at a study reporting p=0.06 and say to yourself, “There’s only a 6/100 chance that this was a coincidental finding. It may not meet the 0.05 threshold, but, at least in this clinical situation, that’s good enough for me, so I think I’ll try this treatment.”
What about those other numbers? “F=10.89” means that the “F value” is 10.89. The F value is computed from the difference between the HAM-D scores in the two treatment groups (it’s a bit more involved than this, because this difference is divided by a factor to correct for variation in individual scores, but for purposes of basic understanding, we don’t need to get into that). Clearly, the higher the F-value, the more of a difference there is between the groups, and the more likely it is that this difference will be statistically significant.
You’ll often see these kind of statistics referred to as “analysis of variance,” and now you can see why it’s called that. It’s the analysis of the variance, or difference, between the averages of two treatment groups.
The “df” in the extract means “degrees of freedom,” an arcane statistical term that in this case equals the number of treatment groups minus one. Believe me, you don’t want to know more than this.
What about the “95% CI for difference= 3.40 to 14.05”? This refers to the 95% confidence interval for the F value (the corrected difference between the treatment groups). This means that we have 95% confidence that the actual corrected difference in HAM-D scores between the two groups is somewhere between 3.40 and 14.05. That’s a large range, to be sure, but the key point here is that we’re 95% certain that the difference is no less than 3.4. And there’s a good chance that the difference is more than that.
2. “…demonstrating a strong standardized effect size of 1.44.”
Knowing that the apparent advantage of ECT over rTMS in these patients was statistically significant is all well and good. But how do we get a handle on measuring how strong this advantage was? This is where “effect size” comes into play. The effect size is the size of a statistically significant difference. To calculate it, you divide the difference in the outcome measure between the two treatment groups by the standard deviation. (Sorry, I’m not going to define standard deviation, since understanding this is not crucial for a basic understanding of effect size).
If the effect size is 0, this implies that the mean score for the treatment group was the same as the comparison group, ie, no effect at all. And just as obviously, the higher the effect size, the stronger the effect of treatment. Here are the standard benchmarks: effect sizes of 0 to 0.3 represent little to no effect, 0.3 to 0.6 a small effect, 0.6 to 0.8 a moderate effect and 0.8 or greater a strong effect. As you can see, the effect size in this study, 1.44, was very strong, meaning that ECT was strongly superior to rTMS in these patients.
By the way, resist the temptation to make up your mind after reading a single study. In this case, it turns out that several other studies have compared rTMS and ECT; some have replicated the findings of this study (Janicak PG, et al, Biol Psychiatry 2002;51:659-667), while others have reported that rTMS is just as good as ECT (Grunhaus L, Biol Psychiatry 2000;47:314-324). Authors usually discuss such discrepancies in the discussion section, and the usual explanation is that the other studies were deficient in some way!