(Effect) Size Matters

Christina here…

Science. It works, bitches.

Science. It works, bitches.

Today I shall science you about research literature and the impact measures of effect size have on detecting the magnitude of effect of some researched treatment modality.

I’ll try to present this information for the lay reader interested in science and in learning how to read and interpret research literature. This might serve as a useful skill, especially if your interests include skepticism of alternative medicine.

The purpose of research, especially medical research, is to determine if a particular variable (modality, treatment, drug, therapy, etc) has a clinically meaningful (i.e. worthwhile and useful) effect. However, a common misconception among researchers and lay people alike is that if the outcome of a study is statistically significant, then one can glean importance or meaning from said research.

The problem: statistical significance is insufficient at determining clinical relevance (In simplest terms: whether or not it works). For that, we need another measure, namely that of effect size.

When people speak of research, “significance” gets frequently batted around as though the word means something like, “the quality of being important or great”.

Here’s a hypothetical research paper abstract result:  ”Clover significantly more effective than placebo at reducing depression”.

To a lay reader, the use of the term “significant” denotes a great difference, which is why research titles like the one above often get translated as, “Clover cures depression!”

But tests of significance don’t tell us if an effect is trivial or massive. They don’t tell us the magnitude of the effect. For that, we need to know the effect size. Unfortunately, the gold standard for worthwhile publication in many scientific journals is statistical significance, i.e. a P-value of less than 0.05.  Citrome, L (2011) referred to this as the “Tyranny of the P”, as many journals use this as their main measure of publication worthiness, especially lower-impact journals.

A lot of published research (Especially of the alternative-medicine variety) will report statistically significant differences, but will not state the magnitude of the differences. Researchers may have statistical differences between the two treatment groups, but that difference might not have any clinically relevant effect which would warrant using the alternative medicine as a treatment.

Effect size  is a measure of the strength of the relationship between two variables. In scientific experiments, in order to determine clinical relevance, we must know not only whether an experiment has a statistically significant effect, but also the size of any observed effects. In other words, if we look at a group of 3,000 people in a treatment group and 3,000 people in a placebo group, we might find significant differences, even when those differences are essentially meaningless or trivial. How much these groups are different or how large the differences carries much greater impact. Researchers can usually find a way to show that there is significant difference between two groups, unless those groups are 100% Identical. The important part is to what degree groups are different. For this, you need to know the effect size. To shamelessly quote the Wikipedia entry on effect size:

Presentation of effect size and confidence interval is highly recommended in biological journals. Biologists should ultimately be interested in biological importance, which can be assessed using the magnitude of an effect, not statistical significance. Combined use of an effect size and its confidence interval enables someone to assess the relationship within data more effectively than the use of p values, regardless of statistical significance. Also, routine presentation of effect size will encourage researchers to view their results in the context of previous research and will facilitate incorporating results into future meta-analysis. However, issues surrounding publication bias towards statistically significant results, coupled with inadequate statistical power will lead to an overestimation of effect sizes, consequently affecting meta-analyses and power-analyses.

I’ve noticed publications favoring alternative medicine rarely contain information about effect size, even though once the data are in place, the calculations can be done in a matter of minutes.

One can calculate effect size in different ways (Here’s a quick calculator). The easiest by far simply compares the mean and standard deviation of two groups. From that comparison, you get a number between 0 and 1, and the number corresponds to the magnitude of the effect. From Cohen (1988):

For Cohen’s d an effect size of 0.2 to 0.3 might be a “small” effect, around 0.5 a “medium” effect and 0.8 to infinity, a “large” effect

Hypothetically, let’s say we’re testing the null hypothesis that a 12-week regimen of homeopathic clover pills taken orally twice a day has no effect on depression rates among subjects.

We’ve got 3,000 patients in the treatment arm and 3,000 patients in the control arm. Our study is double blind and placebo controlled, because we rule like that.

After 12 weeks, we find that of the patients in the treatment arm, 1000 report a reduction in depression symptoms, while of the patients in the placebo arm, 929 report a reduction in depression symptoms.  These results would net a P-value of >0.05, meaning we got significant results. However, the treatment arm response rate clocked in at 33.3%, while the placebo arm response rate clocked in at 30.9% – a 2.4% difference.

The P-value alone could not show that these results are essentially clinically irrelevant.  Because this test had very simple outcomes, we can use simple subtraction to arrive at the conclusion that clover has little clinical effect on depression. For more complex variables, we base our measure of the clinical magnitude using effect size.

I also must note that if we establish that for a given treatment and placebo group, researchers find both a significant difference and a large effect size, then we must propose a testable mechanism to explain this difference. Of course, there is the rub – CAM studies rarely contain testable mechanisms of action. How many times have you heard that an alternative medicine “baffles scientists” or “can’t be explained by science”? You can bet the effects can be explained by science – If a given phenomena has a measurable effect on the universe, we can test it. If researchers cannot determine a testable mechanism, then their methodology is flawed or their effect entirely placebo.

If alt-med wants (personification, I know)  to be considered alongside science, it has to play the science game and the results of studies need to be held to the same standards as science-based medicine. To do that, researchers must show that their particular treatment of interest has clinical relevance beyond mere statistical significance. Additionally, the methodology of studies must be sound and transparent, using such methods as blinding, placebo/control arms and discussion and calculation of effect size. Last, there must me a mechanism – a mechanism of action separates evidence-based medicine from science-based medicine.

Sadly, many people in the alt-med crowd trust “natural” or “herbal” remedies as safe or even superior, but decry commercial pharmaceuticals. They distrust commercial pharmaceuticals despite the fact that nearly all of them are derived from isolated plant compounds that have undergone not just toxicity testing but repeated double-blind placebo controlled studies to show their efficacy as well. They then tout poorly-controlled studies reporting only statistical significance as proof of the efficacy of a particular treatment they deem superior and safer. I am sympathetic, however, because scientific publications are both difficult to interpret and acquire.

In my published research, I always include a measure of effect size, especially because I do biomechanics research in the context of physical rehabilitation, and while we might have the ability to show a significant difference between therapies, effect size matters more owing to the fact that therapies tend to involve a lot of effort and/or time on the part of our clients/subjects. If I can gain a few degrees of shoulder flexion after weeks of intensive therapy, those few degrees while statistically significant, might not mean much clinically.

TL:DR –  Effect size is a more important measure in published research than statistical significance alone. A statistically significant result may cause readers to believe the study reports a meaningful effect on biology, when in fact the study does not. 

Refs:

Citrome, L. The Tyranny of the P-value: Effect Size Matters.  Bullitin of Clinical Psychopharmacology 2011:21(2):91-2. Online here. 

Jacob Cohen (1988). Statistical Power Analysis for the Behavioral Sciences (second ed.). Lawrence Erlbaum Associates.

Also see  Coe, R. It’s the Effect Size, Stupid. What effect size is and why it is important if you want more math.

 

Learn more about Christina and follow her @ziztur.

  • Brad

    Very nice article, Christina.

    Popular headlines will always tend to be sensationalistic, but I think you are right, even in more technical articles, you hardly ever hear about effect size.

    So in your hypothetical clover study, what would the calculated effect size be?

    Are Cohen’s ranges (0.2 to 0.3 = small, 0.5 = medium, 0.8+ = large) still generally used as a guideline?

  • scthinks

    I’m not sure you could find a more amusing typo in the discussion of P-values than “The Tranny of the P-value.” Might want to fix your citation there.

  • slc1

    It is true that, in many branches of science, effect size is equally as important as statistical significance. However, this is not true in all areas of science. It can happen that even very small effects can have enormous significance. A couple of examples from celestial mechanics follow.

    1. In the famous Michelson/Morley experiment in the late 19th century, the observers were looking for an effect of 37/186000 = .02%. The failure to find even this very small effect had a profound influence on the history of physics in the 20th century (e.g. Special Relativity).

    2. Observations of the motion of the planet Mercury made in the 19th century discovered a discrepancy of 43 seconds of arc/century between the computed value and the observed value of the precession rate of the major axis of its orbit. This value, which is even smaller then the expected effect in the Michelson/Morley experiment, also had a profound effect on 20th century physics (e.g. General Relativity).

  • http://Www.ziztur.com Christina

    I agree about significance being important in other branches of science, sorry if that was not made clear.

    Also, tranny of the P – LMAO! Fixed.

  • DFL42

    Well said. Paul Ingraham at saveyourself.ca also writes well about this: http://saveyourself.ca/articles/statistical-significance.php

  • teh_faust

    Nice article. Thank you for clearing that up :)
    “Significance” sounds mighty important and I’ve come across misconceptions about what it means in terms of size or importance even among psych majors.

    I also think we should be accumulating and averaging effect sizes instead of pretending that my own study with its one significant result is the only one that matters. Try to get a complete picture instead of just selectively citing conflictive evidence.
    If you do enough studies with large enough samples to testing the effect of sugar lumps on your asthma, you’re bound to get at least some significant results. And sloppy methods and taking liberties (as cannot be controlled when people just leave out information of the reports) can push the error level well above the 5% that are deemed acceptable.

    I’d like to add that, even in medical or social sciences, smaller effect sizes can be of importance, too, because assuming that an effect is real (i.e. properly replicated and not just a false positive) then it may make an important difference in the lives of some people – even if it is only a small number.
    Given that many psychological treatments tend to have a substantial amount of non-responders, I think it would be useful to have a hierarchy of treatments (even those with smaller effects), because the non-responders don’t have to be the same across treatments, i.e. different patients may respond differently.
    Another important step would be to identify subgroups: I.e. Are there factors that predict what kind of treatment has the best chance of being effective?
    Which is another reason why clearing up mechanisms matters – it allows for better predictions.

    • http://www.facebook.com/ziztur Christina

      I agree with you here too. If my hypothetical experiment were testing clover as a cure for depression and the main outcome was suicide instead of the more general “reduction in symptoms” and we got the same numbers, we would probably remark that while the effect size is small, the net result saves lives, and a 2% reduction in suicide rates is of particular importance.

  • MurOllavan

    And on top of that statistical significance is often reported in a fashion that makes it seem just because there is a 5% probability results could be due to chance that means its a 95% chance they aren’t(irregardless of effect size).

    Is worst in court rooms w/DNA evidence. 1 in a million chance of being their DNA is != chance they did the crime. Or used as a fallacy say universe odds w/o god is 1 in a trillion doesn’t mean p(god) is 999,999,999,999-1 fav.

    I like your posts so far.

    • MurOllavan

      Meant to say “chance they didn’t do the crime”.

  • Aspect Sign

    Nice article, thanks.

    Pieces like this have great value it’s short clear and useful especially to those without a technical science education.

    Often for those who are inquisitive and capable of logic and reason differences like this become obvious, that doesn’t necessarily translate into being able to clearly communicate the significance. Someone taking the time to write a piece like this is invaluable.

    At a time like this when the skeptical community is seeing a lot of growth it is even more important. While many are coming to realize that their views should be based on evidence, that doesn’t automatically translate into having the skills to evaluate evidence.

    Those that have those skills, sharing them helps a lot, both directly and indirectly by giving others the clear explanations they might not have themselves.