Christina here…

Today I shall science you about research literature and the impact measures of **effect size** have on detecting the magnitude of effect of some researched treatment modality.

I’ll try to present this information for the lay reader interested in science and in learning how to read and interpret research literature. This might serve as a useful skill, especially if your interests include skepticism of alternative medicine.

The purpose of research, especially medical research, is to determine if a particular variable (modality, treatment, drug, therapy, etc) has a clinically meaningful (i.e. worthwhile and useful) effect. However, a common misconception among researchers and lay people alike is that if the outcome of a study is statistically significant, then one can glean importance or meaning from said research.

The problem: statistical significance is insufficient at determining clinical relevance (In simplest terms: *whether or not it works*). For that, we need another measure, namely that of *effect size*.

When people speak of research, “significance” gets frequently batted around as though the word means something like, “the quality of being important or great”.

Here’s a hypothetical research paper abstract result: “Clover significantly more effective than placebo at reducing depression”.

To a lay reader, the use of the term “significant” denotes a great difference, which is why research titles like the one above often get translated as, “Clover cures depression!”

But tests of significance don’t tell us if an effect is trivial or massive. They don’t tell us the magnitude of the effect. For that, we need to know the effect size. Unfortunately, the gold standard for worthwhile publication in many scientific journals is statistical significance, i.e. a P-value of less than 0.05. Citrome, L (2011) referred to this as the “Tyranny of the P”, as many journals use this as their main measure of publication worthiness, especially lower-impact journals.

A lot of published research (Especially of the alternative-medicine variety) will report statistically significant differences, but will not state the magnitude of the differences. Researchers may have statistical differences between the two treatment groups, but that difference might not have any clinically relevant effect which would warrant using the alternative medicine as a treatment.

*Effect size* is a measure of the strength of the relationship between two variables. In scientific experiments, in order to determine clinical relevance, we must know not only whether an experiment has a statistically significant effect, but also the size of any observed effects. In other words, if we look at a group of 3,000 people in a treatment group and 3,000 people in a placebo group, we might find significant differences, even when those differences are essentially meaningless or trivial. *How much* these groups are different or *how large* the differences carries much greater impact. Researchers can usually find a way to show that there is significant difference between two groups, unless those groups are 100% Identical. The important part is to what *degree *groups are different. For this, you need to know the effect size. To shamelessly quote the Wikipedia entry on effect size:

Presentation of effect size and confidence interval is highly recommended in biological journals. Biologists should ultimately be interested in biological importance, which can be assessed using the magnitude of an effect, not statistical significance. Combined use of an effect size and its confidence interval enables someone to assess the relationship within data more effectively than the use of p values, regardless of statistical significance. Also, routine presentation of effect size will encourage researchers to view their results in the context of previous research and will facilitate incorporating results into future meta-analysis. However, issues surrounding publication bias towards statistically significant results, coupled with inadequate statistical power will lead to an overestimation of effect sizes, consequently affecting meta-analyses and power-analyses.

I’ve noticed publications favoring alternative medicine rarely contain information about effect size, even though once the data are in place, the calculations can be done in a matter of minutes.

One can calculate effect size in different ways (Here’s a quick calculator). The easiest by far simply compares the mean and standard deviation of two groups. From that comparison, you get a number between 0 and 1, and the number corresponds to the magnitude of the effect. From Cohen (1988):

For Cohen’sdan effect size of 0.2 to 0.3 might be a “small” effect, around 0.5 a “medium” effect and 0.8 to infinity, a “large” effect

Hypothetically, let’s say we’re testing the null hypothesis that a 12-week regimen of homeopathic clover pills taken orally twice a day has no effect on depression rates among subjects.

We’ve got 3,000 patients in the treatment arm and 3,000 patients in the control arm. Our study is double blind and placebo controlled, because we rule like that.

After 12 weeks, we find that of the patients in the treatment arm, 1000 report a reduction in depression symptoms, while of the patients in the placebo arm, 929 report a reduction in depression symptoms. These results would net a P-value of >0.05, meaning we got significant results. However, the treatment arm response rate clocked in at 33.3%, while the placebo arm response rate clocked in at 30.9% – a 2.4% difference.

The P-value alone could not show that these results are essentially clinically irrelevant. Because this test had very simple outcomes, we can use simple subtraction to arrive at the conclusion that clover has little clinical effect on depression. For more complex variables, we base our measure of the clinical magnitude using effect size.

I also must note that *if* we establish that for a given treatment and placebo group, researchers find both a significant difference *and* a large effect size, then we must propose a* testable mechanism* to explain this difference. Of course, there is the rub – CAM studies rarely contain testable mechanisms of action. How many times have you heard that an alternative medicine “baffles scientists” or “can’t be explained by science”? You can bet the effects can be explained by science – If a given phenomena has a measurable effect on the universe, we can test it. If researchers cannot determine a testable mechanism, then their methodology is flawed or their effect entirely placebo.

If alt-med wants (personification, I know) to be considered alongside science, it has to play the science game and the results of studies need to be held to the same standards as science-based medicine. To do that, researchers must show that their particular treatment of interest has clinical relevance beyond mere statistical significance. Additionally, the methodology of studies must be sound and transparent, using such methods as blinding, placebo/control arms and discussion and calculation of effect size. Last, there must me a mechanism – a mechanism of action separates evidence-based medicine from science-based medicine.

Sadly, many people in the alt-med crowd trust “natural” or “herbal” remedies as safe or even superior, but decry commercial pharmaceuticals. They distrust commercial pharmaceuticals despite the fact that nearly all of them are derived from isolated plant compounds that have undergone not just toxicity testing but repeated double-blind placebo controlled studies to show their efficacy as well. They then tout poorly-controlled studies reporting only statistical significance as proof of the efficacy of a particular treatment they deem superior and safer. I am sympathetic, however, because scientific publications are both difficult to interpret and acquire.

In my published research, I always include a measure of effect size, especially because I do biomechanics research in the context of physical rehabilitation, and while we might have the ability to show a significant difference between therapies, effect size matters more owing to the fact that therapies tend to involve a lot of effort and/or time on the part of our clients/subjects. If I can gain a few degrees of shoulder flexion after weeks of intensive therapy, those few degrees while statistically significant, might not mean much clinically.

**TL:DR – Effect size is a more important measure in published research than statistical significance alone. A statistically significant result may cause readers to believe the study reports a meaningful effect on biology, when in fact the study does not. **

Refs:

Citrome, L. The Tyranny of the P-value: Effect Size Matters. Bullitin of Clinical Psychopharmacology 2011:21(2):91-2. Online here.

Jacob Cohen (1988). *Statistical Power Analysis for the Behavioral Sciences* (second ed.). Lawrence Erlbaum Associates.

Also see *Coe, R. It’s the Effect Size, Stupid. What effect size is and why it is important *if you want more math.

*Learn more about Christina and follow her @ziztur.*