Since I became a researcher in the late 1990s, I have been an advocate of using effect size and meta-analysis for summarising and combining results from research.
It seems hard to imagine now, but back then, both approaches were well outside the mainstream of educational research practice and were neither widely used nor understood by most education researchers, let alone policymakers or teachers.
Two events that helped to change that were the publication of John Hattie’s Visible Learning in 2009 and the Sutton Trust’s (later taken on by the Educational Endowment Foundation) Teaching and Learning Toolkit in 2011. Today it is common for effect sizes to be talked about in blogs, training events and professional conversations among teachers and at least some understanding of what they are seems to be much more widely shared.
Perhaps inevitably, as an idea becomes popular it is sometimes misunderstood, over-simplified or even abused. As it becomes influential, people rightly worry that these misuses carry real and harmful consequences.
In his 2016 book Leadership for Teacher Learning, Dylan Wiliam devotes a substantial section to discussing the affordances and limitations of meta-analysis and the problems with effect sizes. He does acknowledge that “the standardized effect size is a huge improvement on previous ways of reporting the outcomes of educational experiments” (p82), by which he means the use of significance tests alone, but then goes on to outline a number of specific threats or confounds that arise in interpreting effect sizes from research. These include: the problems of comparing interventions with different intensity or duration; or that use measures with different sensitivity to instruction (whether the test captures what is actually taught – a particular problem for standardised tests); the ‘file drawer problem’; the fact that the effect size of a year’s typical growth varies dramatically with age; and the relevance and generalisability of existing studies to the questions we actually want to ask.
Wiliam concludes with an unequivocal warning: “right now meta-analysis is simply not a suitable technique for summarizing the relative effectiveness of different approaches to improving student learning, and any leader who relies on meta-analysis to determine priorities for the development of the teachers they lead is likely to spend a great deal of time helping their teachers improve aspects of practice that do not benefit students.”
A similarly hard-hitting critique of the use of effect sizes to guide policy and practice decisions comes in a 2017 paper by my Durham colleague Adrian Simpson. He argues that effect sizes are confounded with features of the experimental designs used and that, crucially, these methodological features are correlated with particular types of intervention. Hence these errors will not cancel each other out, but systematic inflation or deflation of the estimates of the ‘effectiveness’ of different types of intervention will result.
Simpson gives specific examples of the confounds that concern him and gives actual and simulated examples to show they can be substantial:
A third attack comes from Bob Slavin, a researcher who has himself not only done many systematic reviews, but also contributed substantively to their methodology over many years. Slavin’s critique focusses specifically on John Hattie’s oversimplified league tables of effects.
Slavin’s critique is particularly powerful because of his impressive contributions to the field of evidence synthesis in education. He has not only developed his own approach that he calls ‘best-evidence synthesis’ (which combines features of meta-analytic and narrative reviews), and conducted multiple reviews using this approach, but also highlighted many of the problems that arise in interpreting effect sizes.
An important 2015 paper with Alan Cheung, for example, showed how features such as small-scale versus large trial, whether it used experimenter-designed or independent outcome measures, whether it is published or not and whether it used a quasi-experimental versus randomised design can each double the effect size for the same intervention.
Inevitably, these criticisms have gained some traction and people are questioning whether we should be a little more cautious about some of the things we thought we knew. For me, one of the interesting features of this debate is that teachers are at the heart of it. In the 1990s, I, along with many other researchers (including my fellow panel members) tried very hard to get teachers to engage with educational research; my recollection is that it was mostly quite frustrating and unsuccessful.
So one of the things I love about researchED and the other online communities of research-aware teachers is that these kinds of debates among researchers are now being picked up, taken on, added to and refocussed on the issues that really matter in practice.
A series of blogs by Greg Ashman and podcast interviews with both Hattie and Simpson by Ollie Lovell are great examples of this.
So what are we to make of all this criticism?
Should we abandon any attempt to compare or combine effect sizes from different types of study (as Simpson advocates)? Or are other approaches to systematic review (such as Slavin’s best evidence synthesis, which is also advocated by Wiliam) safe enough? Where does this leave the meta-meta-analysis that tries to give the overview of the relative impact of different interventions – the kind of choice that a real teacher or school leader may face?
Read Part 2 : What should we do about meta-analysis and effect size?