Gene V Glass Archives: Understanding Meta-Analyses

UNDERSTANDING META-ANALYSES

Gene V Glass

DECADES AGO, say prior to the explosion of research on education in the 1960s, summarizing the research on a topic was relatively simple. There might be five relevant studies to consider, and a narrative review was possible. ‘‘Jones (1945) found superior reading achievement at grade three for students initially taught by the Whole Language method; however, Lopez (1938) found no such differences for a group of 125 students whose home language was Spanish.’’ However, as early as the 1970s, the number of empirical research studies on many topics of interest to educators and policy makers was multiplying wildly. By the late 1970s, a hundred studies of the relationship between school class size and achievement confronted anyone wishing to review that literature. The narrative review no longer sufficed. The typical narrative review of the 1970s—as would have been reported in the Review of Educational Research, for example—became a long list of ‘‘positive’’ and ‘‘negative’’ findings followed by a call for additional research. Systematic methods of coding, organizing, and displaying information were needed; traditional statistical analysis methods in the form of meta-analysis proved to be sufficient to the task. Although meta-analysis was originally developed for summarizing research in education and psychology (Glass, 1976), today it is applied much more frequently in medicine.

WHAT IS META-ANALYSIS?

Meta-analysis is a collection of statistical analysis techniques used to summarize a body of empirical research studies on a particular topic, for example, Is Whole Language a better method of initial reading instruction than Phonics? Or, do students who are retained in a grade for an additional year benefit subsequently more than those who are not? Its findings are expressed in quantitative terms, for example, “Aggregating the findings of eighty-five experiments on the effects of Ritalin on classroom behavior, we find that there are on average 22 percent fewer discipline problems reported in a year’s time for the children placed on Ritalin than in a comparable group on a placebo or receiving no treatment at all for Attention-Deficit Disorder.”

Meta-Analysis Illustrated

An illustration of a simple meta-analysis will clarify how these techniques are applied in performing a meta-analysis. Suppose that we are interested in what the existing research studies say about the effectiveness of Whole Language versus Phonics as methods of initial reading instruction. From the very first step a controversial decision must be made in conducting a literature search: Which studies and how many studies should be included? This is in part a question about how one should go about searching for the relevant literature, and in part about what type of studies should be included. On the latter question, experts can disagree. Some maintain that only the “best” studies should be included (Slavin, 1986); others advise including every relevant study, whether it is “good,” “bad,” or “indifferent” (Glass, 1978). As to how relevant studies should be located, the life of the meta-analyst has never been better. The internet has greatly facilitated the collecting of research; Google Scholar (http://scholar.google.com), and the online ERIC (www.eric.ed.gov/) are invaluable resources for locating studies. The old-fashioned technique of branching bibliographies is still an important means of collecting research for a meta-analysis: find a very recent study on the topic, inspect the list of references cited in that study, go to those studies and do likewise; in this way, one will quickly compile a list of most of the relevant research. The only glitch might occur if newer studies exist than the “recent study” one began with; but “search forward” resources are available in which one can start with an older study and find all later studies that reference it (see the ISI Web of Knowledge, http://scientific.thomson.com/isi/).

Assume that our meta-analyst studying Whole Language (WL) versus Phonics (PH) has collected most of the relevant literature and is prepared to take the next steps. Suppose that two hundred experimental studies comparing the two methods of teaching reading have been located. Two of these many studies are “Jones (1945)” and “Lopez (1938).” In order to keep straight all of the information contained in these many studies, the meta-analyst has devised a database structure for coding the studies. Coding involves defining key features of the studies (for example, when the study was published, the ages of the children, the length of the instruction, and the like) and describing these features by means of numerical, alphabetical, or word codes. The two studies by Jones and Lopez are coded in Table 1.

In reality, many more characteristics of studies would be coded than the few illustrated in Table 1.

For comparative experimental studies like these, the focus of attention is on the outcome measures, such as the achievement or attitude results. Jones administered the California Achievement Test and found a six-month grade equivalent (GE) superiority for the WL group when compared to the PH group’s test results. Lopez administered the Iowa Test of Basic Skills and found a one-month GE superiority for the WL students initially taught to read by the WL method. There are dozens of ways to measure reading achievement of elementary grade students. If the meta-analyst is required to keep all of the findings separate depending on which test was administered, the general trends in the two hundred studies will be lost in the confusion of dozens of individual results. In truth, the standardized tests of reading achievement differ only in insignificant particulars; they test vocabulary and comprehension, and that’s about it. The meta-analyst calculates for each study a general measure of the relative superiority of WL versus PH known as the effect size (ES). The effect size is like a standard score in that it expresses the comparison of WL and PH on a scale that is independent of the mean and standard deviation (SD) of the original scales used to measure outcomes. For example, suppose that on the Scholastic Aptitude Test (SAT) verbal test women score on average 550 and men score 500. The SAT has a standard deviation of 100 points; therefore, women exceed men by one-half standard deviation, or, the ES comparing women and men is 0.50, favoring women. Now suppose that on the American College Test (ACT) Reading test, women score 21.5 on average and men score 19.0. The ACT Reading test has a standard deviation of 5 points, so the ES comparing women and men on the ACT Reading test is also 0.50. We have compared women and men on verbal tests even though those tests use quite different measurement scales.

In the current example of WL and PH, the Jones study gives an ES as follows:

ES = (Mean-WL − Mean-PH) / [(SD-WL + SD-PH)/2]

ES = (4.4 − 3.8) / [(1.1 + 1.0) / 2] = .6 /1.05 = 0.57

The two standard deviations are averaged to estimate the standard deviation of a group of students taught by either method. The effect size is interpreted as follows: the mean of the WL group is approximately six-tenths of a standard deviation higher on reading achievement than the mean of the PH group. The significance of the ES becomes clearer when it is displayed graphically as in Figure 1.

Figure 1 depicts the results of the Jones study on a scale of reading achievement that can be compared from one study to the next. The two curves in Figure 1 are separated at the mean by 0.57 standard deviations. One important implication of this effect size is that, assuming normal distributions of reading achievement scores, the average student taught by the WL method scores above the 72nd percentile of the distribution of scores of students taught by the PH method. So a student at the center of the reading achievement distribution benefits by 22 percentile ranks (50th versus 72nd) when taught initial reading by the WL method.

For the students whose home language was Spanish in the Lopez study, the ES comparing WL and PH is ES = (3.1 − 3.0) / [(0.9 + 1.1) / 2] = .1 / 1.0 = 0.10

When two normal distributions of scores differ at the mean by .10 standard deviations, the average student (at the 50th percentile) in the higher distribution exceeds 54 percent of the students in the lower distribution, an advantage of only 4 percentile ranks.

A typical meta-analysis might involve the calculation of hundreds of effect sizes. Obviously, such a huge collection of data cannot be comprehended without the aid of further statistical analysis. This is where meta-analysis makes its entrance as the analysis of analyses. We suppose that “Jones (1945)” and “Lopez (1938)” are just two of two hundred comparative experiments on WL and PH. When the ES measures from all two hundred studies are calculated and averaged, the meta-analysis has begun. Suppose that across all two hundred studies the average ES is equal to 0.45 favoring WL over PH. This finding can be depicted as in Figure 2.

Surely among these two hundred effect size measures there is variability that can illuminate the choice between WL and PH. Whole Language teaching might be superior for one type of student but Phonics could be superior for another type. Or, perhaps, inexperienced beginning teachers have more luck with Phonics instruction than with Whole Language. The meta-analyst will investigate these questions by categorizing the data according to the coding conventions of the database and then reporting the average ES for various cross-tabulations, as in Table 2.

Here, the cross-tabulation of the effect sizes paid big dividends. WL was seen to be somewhat insignificantly inferior to PH for inexperienced teachers (in the first or second year of teaching), and insignificantly superior to PH for moderately experienced teachers, but extraordinarily superior to PH for teachers with six or more years experience. In fact, an ES of 2.0 implies that the average student taught by WL exceeds in reading achievement more than 97 percent of the students taught by PH. (Of course, this illustration is hypothetical and exaggerated to make the point clearer.)

At the same time, the subdividing of the database by “teacher experience” revealed the value of greater specificity in the questions being asked and the inadvisability of lumping together data arising from dissimilar circumstances that ought not to be ignored. The meta-analyst never knows in advance which distinctions will prove to be important or whether the corpus of studies being analyzed will permit making the important distinctions. Achievement is just one of the outcomes that would be measured in an experiment comparing WL and PH. Attitude, or how much the children enjoy and seek out reading on their own, would likely also be evaluated in many studies. Jones, for example, administered the Moore Reading Interest Scale to the 138 students in her study. The Moore Scale consists of twenty items, for example, “I like to go to the library,” each of which the student rates on a 5-point scale from 1 (Disagree) to 5 (Agree). A maximum score of 100 is possible though rarely obtained. The summary statistics for the two groups, WL and PH, in the Jones study appear in Table 1. An effect size can be calculated for these data in the same manner that it was calculated for the California Achievement Test data:

ES = (Mean-WL − Mean-PH) / [(SD-WL + SD-PH) / 2] ES = (76 − 65) / [(12.6 + 11.9) / 2] = 11 / 12.25 = 0.90

The Lopez study also measured children’s interest in reading at the end of the experimental period. The Hopkins Reading Inventory is an interest inventory consisting of a list of fifteen activities: “I ask my parent to take me to the library,” “I want to get a book for my birthday,” and the like. A child’s score is the number of activities he or she reports engaging in. Scores can range from 0 to 15. The summary statistics for the Lopez study appear in Table 1. The effect size on the Hopkins Inventory for the seventy-four students in the Lopez study is as follows:

ES = (Mean-WL − Mean-PH) ÅÄ [(SD-WL + SD-PH) ÅÄ 2] ES = (9.4 − 9.2) ÅÄ [(3.2 + 3.4) ÅÄ 2] = .2 ÅÄ 3.3 = 0.06

The superiority of WL over PH on children’s attitudes toward reading is very evident in the Jones study (ES = 0.90), but far less so in the Lopez study. In fact, one could conclude that there is no real difference in reading interest at all in the Lopez study. The meta-analyst will begin to search for explanations. As the WL method was much less effective relative to PH for the children whose home language was Spanish, is this the reason why no important difference in attitude toward reading was also seen in the Lopez study?

At this point, the critic of meta-analysis might say that things have gone too far. Lumping together results (ESs) from the California Achievement Test (CAT) and the Iowa Test of Basic Skills (ITBS) might be acceptable, but comparing the results on two different reading attitude inventories is comparing apples and oranges. But the two achievement tests were also “different.” In fact, CAT and the ITBS have separate forms that pose slightly different questions (“old . . . young; same or opposite?” or “weak . . . strong; same or opposite?”). Any meta-analysis involves the comparison of “different studies.” Only “different studies” can be compared, for if two studies were “the same,” there would be no point in comparing them; they would show the same results. (The “apples and oranges” criticism is addressed in more detail below.)

Comparative experimental studies lend themselves to description of their results with the effect size measure. Other kinds of studies are appropriately described by different measures. For example, studies of the relationship between parents’ level of education and their children’s academic achievement might be best described by a correlation coefficient (see, for example, White, 1982). A body of studies might show an average correlation coefficient between mother’s level of education and child’s level of 0.35, but an average correlation of .20 between father’s level and child’s level.

EVALUATING META-ANALYSES

You may never perform a meta-analysis yourself, but the chances are good that in your role as a professional practitioner or even as an individual who faces important choices about your health or the education of your children, your relatives, or a friend, you will encounter a published meta-analysis. You will then want to know if you should believe what it says. The following list of questions, with some suggestions about what to look for in their answers, should help you in evaluating the claims emanating from a particular meta-analysis.

How Good Was the ‘‘Literature Search’’ That Drew Together the Collection of Studies to Be Analyzed? There are several characteristics of how the studies to be analyzed were collected that will bear on the strength of the results. Did the literature search catch the most recent work? Any well-performed meta-analysis will provide the reader with a list of the studies included in the analysis. It’s an easy matter to check the dates of the studies to make sure that current work has been included. On the other end of the date spectrum, one ought to ask whether arbitrary decisions by the meta-analyst about when to cut off the search, such as “nothing before 1970 will be included,” might have excluded important work.

Did the meta-analysis miss important studies merely because the analyst was lazy or underfunded? For example, for certain topics, the dissertation literature is rich and significant, but often analysts ignore it, either because it is thought to be of “low quality” or because it is too costly to access.

Did the meta-analyst exclude important work on arbitrary grounds that may have biased the results? Meta-analysts tend to be statistics experts or methodologists. As such, they often have strong opinions about what research is “good” and what research is “bad,” even when those opinions are entirely a priori in the sense that the “good” and the “bad” studies show essentially the same thing. Any meta-analyst must draw boundaries around the literature and exclude some work for practical reasons. Drawing these boundaries is a matter of judgment. Beware when the meta-analyst reports that only the “best evidence” was included in the analysis. Too often, that “best evidence” is the work of the analyst, the analyst’s students, or like-minded friends (Slavin, 1986).

Does the Meta-Analysis Compare Apples and Oranges? The single most frequent criticism of meta-analysis is that it compares “apples and oranges.” It is also the single most wrongheaded criticism. Nothing will prepare the reader for understanding meta-analyses more than to get one’s thoughts straight on what it means to compare apples and oranges.

Of course meta-analysis mixes apples and oranges; in the study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor worthy of serious inquiry; comparing apples to apples is trivial.

The unthinking critic will claim that it only makes sense to integrate any two studies if they are studies of “the same thing.” But these critics who argue that no two studies should be compared unless they were studies of the “same thing” blithely compare persons (such as experimental “subjects”) within their studies all the time. This is inconsistent. It is self-contradictory to assert that “No two things can be compared unless they are the same.” If they are the same, there is no reason to compare them; indeed, if “they” are the same, then there are not two things, there is only one thing and comparison is not an issue. One study is an apple, and a second study is an orange. I compare fruits when I’m hungry and have to decide between an apple and an orange.

Any two studies differ in an infinite number of ways, even when they are on the same topic. Study 1 compares students’ achievement in class sizes of thirty-five and twenty; the subject taught is beginning algebra. Study 2 compares class sizes of twenty and fifteen; the subject taught is beginning geometry. Can their results be compared? Yes, of course, for some purposes. They can even be compared with Study 3 that evaluates achievement in class sizes of forty and twenty in American history. If all three studies show superior achievement for the smaller class size, we learn something. For the reader who is interested only in individualization of mathematics instruction, Study 3 might be the orange that will have to be separated from the apples, Studies 1 and 2. Different readers will have different opinions about how much the studies in a meta-analysis can differ before the findings cease to be relevant.

If a critic wishes to dismiss a particular meta-analysis from anyone’s consideration because it compares apples and oranges, it might be well to ask whether the critic simply doesn’t like the findings of the analysis.

Does the Meta-Analysis Ignore Important Differences Among Studies? This question gets at the other side of the coin of the apples and oranges problem. Similar to the wrongheaded opinion that any two studies that are different can’t be integrated, ignoring important differences in the integration of multiple studies is ill advised. Suppose that a meta-analysis is performed on a hundred experimental studies comparing distance learning with traditional large-group lecture instruction for college-age undergraduates. And suppose further that these one hundred studies yielded an average ES of .10—not impressive, perhaps, but not zero. In fact, the finding might be regarded as so unimpressive that readers of the meta-analysis decide that the added cost and trouble of mounting a distance education program just isn’t worth it. This conclusion might have been different if the meta-analyst had reported the average effect sizes separately for different types of subject taught, as in Table 3, for example.

A meta-analysis that failed to take into account the important distinction of what subject is being taught via distance learning would surely leave its readers with a misleading if not false conclusion. Mathematics teaching by means of distance instruction works quite well; in fact, the average student studying math by distance teaching outperforms 74 percent of the students studying math in traditional large-group college instruction. But distance education doesn’t do so well in teaching composition, perhaps because the interaction between the instructor and the writer is curtailed, or in chemistry and biology classes, perhaps because there is just no adequate “online” substitute for laboratory work.

A good meta-analysis will make every attempt to examine and compare the results of the studies after they have been grouped according to important distinctions that may bear on the strength of the results. Often, the original studies being integrated in a meta-analysis do not do a good job of reporting the particular circumstances under which the study was performed, for example, distance learning was compared with large-group instruction, but the author of the study failed to record what subject was taught. There’s not much that the meta-analyst can do about this situation, but that does not necessarily mean that the study should be ignored or discarded.

Another Example

Consider an example that may help illuminate these matters. Perhaps the most controversial conclusion from the psychotherapy meta-analysis that my colleagues and I published in 1980 was that there was no evidence favoring behavioral psychotherapies over nonbehavioral psychotherapies. This finding was vilified by the behavioral therapy camp and praised by the Rogerians and Freudians. Some years later, I returned to the database and dug a little deeper. What I found appears in Figure 3. When the nine experiments extant in 1979 in which behavioral and nonbehavioral psychotherapies are compared in the same experiment between randomized groups, and the effects of treatment are plotted as a function of follow-up time, the two curves in Figure 3 result. The findings are quite extraordinary and suggestive. Behavioral therapies produce large short-term effects which decay in strength over the first year of follow-up; nonbehavioral therapies produce initially smaller effects which increase over time. The two curves appear to be converging on the same long-term effect. I leave it to the reader to imagine why. One answer, I suspect, is not arcane and is quite plausible.

Figure 3, I believe, is a truer reflection of reality and how research, even meta-analysis, can lead us to more sophisticated understandings. Indeed, the world encompasses all manner of interesting differences and distinctions, and in general, gross averages do not do it justice. However, denying the importance of an average effect size that one does not like for personal reasons simply because the data are not broken down by one’s favorite distinction is not playing the game fairly. Not every distinction makes a real difference, as for example, when a graphologist (handwriting expert) claims that a pile of negative findings really hides great successes for graphologists trained by vegetarians.

Are the Findings of the Meta-Analysis Generalizable? Surely a fact that doesn’t generalize is of no use to anyone. But it does not follow that the methods of inferential statistics (significance test, hypothesis tests, confidence intervals) are the best means of reaching a general conclusion. The appropriate role for inferential statistics in meta-analysis is not merely unclear; it has been seen in quite disparate ways by different methodologists since meta-analysis first appeared.

Inferences to populations of persons seems quite unnecessary, because even a meta-analysis of modest size will involve a few hundred persons (nested within studies) and lead to nearly automatic rejection of null hypotheses. Moreover, the chances are remote that the persons or subjects within studies were drawn from defined populations with anything even remotely resembling probabilistic techniques. Hence, probabilistic calculations advanced as if subjects had been randomly selected would be dubious. At the level of “studies,” the question of the appropriateness of inferential statistics can be posed again, and the answer again seems to be negative. There are two instances in which common inferential methods are clearly appropriate, not just in meta-analysis but in any research: (1) when a well-defined population has been randomly sampled, and (2) when subjects have been randomly assigned to conditions in a controlled experiment. The latter case is of little interest to meta-analysts who never assign units to treatments. Moreover, the typical meta-analysis virtually never meets the condition of probabilistic sampling of a population—though in the case of Smith, Glass, & Miller, (1980), the available population of psychoactive drug treatment experiments was so large that a random sample of experiments was in fact drawn for the meta-analysis. Inferential statistics has little role to play in meta-analysis.

It is common to acknowledge, in meta-analysis and elsewhere, that many data sets fail to meet probabilistic sampling conditions, and then argue that one ought to treat the data in hand “as if” it were a random sample of some “hypothetical population.” One must be wary here of the slide from “hypothesis about a population” into “a hypothetical population.” They are quite different things, the former being standard and unobjectionable, the latter being a figment that we hardly know how to handle.

The notion of a hypothetical population appears to be circular. If the sample is fixed and the population is allowed to be hypothetical, then surely the data analyst will imagine a population that resembles the sample of data. If I show you a handful of red and green M&Ms, you will naturally assume that I have just drawn my hand out of a bowl of mostly red and green M&Ms, not red and green and brown and yellow ones. Hence, all of these “hypothetical populations” will be merely reflections of the samples in hand and there will be no need for inferential statistics. Or put another way, if the population of inference is not defined by considerations separate from the characterization of the sample, then the population is merely a large version of the sample. With what confidence is one able to generalize the character of this sample to a population that looks like a big version of the sample? Well, with a great deal of confidence, obviously. But then, the population is nothing but the sample writ large, and we really know nothing more than what the sample tells us, in spite of the fact that we have attached misleadingly precise probability numbers to the result.

Hedges and Olkin (1985) have developed inferential techniques that ignore the pro forma testing (because of large N) of null hypotheses and focus on the estimation of regression functions that estimate effects at different levels of study. They worry about both sources of statistical instability: that arising from persons within studies and that which arises from variation between studies. The techniques they present are based on traditional assumptions of random sampling and independence. It is, of course, unclear to me precisely how the validity of their methods is compromised by failure to achieve probabilistic sampling of persons and studies.

Does the Meta-Analysis Test a Theory? This answer may surprise you. Metaanalysis has nothing to do with theories. In spite of its superficial similarity to things that look very scientific, such as statistical formulas, meta-analyses are not tests of theories. Rather, meta-analysis is useful in the practical evaluation of techniques and methods. Do students learn more in small classes than in large? Do students who are made to repeat an elementary grade do better or worse throughout their subsequent years in school? Technological evaluations are mostly, though not exclusively, matters of cost-effectiveness estimation. Meta-analysis can be helpful in assessing the effectiveness side of the equation. When joined with cost analysis, meta-analysis can make its best contribution to practical affairs. (See Levin, Glass, & Meister, 1987, listed in references as an example.)

SUMMARY

Prior to about 1975, the most common mode of reviewing a collection of studies on a topic was the narrative review. That method tended to break down as the literature grew to dozens and in some cases hundreds of studies. Methods of describing (“coding”) and statistically analyzing the results of multiple empirical studies were developed and now are classified under the name “meta-analysis.”

A typical meta-analysis might result in a conclusion such as, “Twenty studies comparing guided discovery versus the lecture method of teaching addition of fractions show that on average the typical student learning by the guided discovery method outperforms 70 percent of the students learning by the lecture method.” The validity of a meta-analysis rests on such considerations as the thoroughness of the literature search, the care exercised in coding studies, and proper methods of statistical analysis. Generalizing the findings of a meta-analysis is a topic on which even the experts disagree.

Gene V Glass Archives

Saturday, October 1, 2022

Understanding Meta-Analyses

No comments:

Post a Comment

Review of <i>Fertilizers, Pills, and Magnetic Strips: The Fate of Public Education in America</i>

Report Abuse