1993
Glass, G. V & Martinez, B. A. (1993, June 3). Politics of teacher evaluation. Proceedings of the CREATE Cross-Cutting Evaluation Theory Planning Seminar (ED364581, pp. 121–134). ERIC. https://files.eric.ed.gov/fulltext/ED364581.pdf
Writings of Some General Interest, Not Readily Available Elsewhere. To receive a printable copy of an article, please email gvglass @ gmail.com.
Reviewed by Gene V Glass
Arizona State University
June 19, 1995
The Handbook of research synthesis is the third volume of a coordinated publication program on meta-analysis sponsored by the Russell Sage Foundation. Starting in 1987 under the direction of a Research Synthesis Committee (Harris Cooper, Thomas Cook, David Cordray, Heidi Hartmann, Larry Hedges, Richard Light, Thomas Louis and Frederick Mosteller), the project has previously produced The future of meta-analysis (Wachter and Straf, 1990) and Meta-analysis for explanation (Cook et al., 1992). The Handbook is by far the largest and most comprehensive publication of this project. It means to be the "definitive vade mecum for behavioral and medical scientists intent on applying the synthesis craft."(p. 7) At nearly 600 hundred pages and three pounds, researchers will have to leave their laptops behind.
Although the editors and many of the chapter authors eschew the term "meta-analysis" in favor of the broader "research synthesis," potential readers should understand that the former (statistical analysis of summary statistics from published reports) is the subject of the Handbook and not the more general concerns of theory commensurability or the planning of coordinated investigations suggested by the latter.
The organization of the Handbook follows the common logic of producing a meta-analysis: formulate the question, search the literature, code the information, analyze it, write a report. Some of the chapters are unremarkable, since much of the craft of doing research is routine; this only speaks to the completeness of the work. Chapter 6, "Research Registers" by Kay Dickersin, points to new possibilities. Medicine has databases of prospective, on-going and completed studies; Dickersin identifies 26 of them. Expand them slightly to include the actual data from clinical trials and other forms of study and many of the more vexing problems of meta- analysis (which arise from the telescoping of primary data into summary statistics--and the discarding of the former) will be solved. It is past time when behavioral research, both on-going and completed, is catalogued and archived. Telecommunications has driven the costs of information storage and retrieval to near zero. Who will create the Internet Behavioral Research Archives?
Two themes imparted by the editors and the committee, one presumes, give the Handbook of research synthesis its distinctive character. Chapter 1 by the editors, Harris Cooper and Larry Hedges, is entitled "Research Synthesis as a Scientific Enterprise." Research synthesis is likened to doing science itself: both are seen as involving problem formulation, data collection, data evaluation, analysis and publication. These stages in both the pursuit of science and the conduct of research synthesis give the Handbook its section titles, and perhaps its entire bent. Although these stages might reasonably describe the stages in carrying out a meta-analysis, they do not capture what is distinctive about science. The stages describe as well how one may conduct the evaluation of a device, a drug and program or what-have-you. In effect, the Handbook draws no clear or convincing line between the pursuit of scientific theory and the evaluation of technology. This line is quite important and must be drawn.
To cast meta-analysis as dedicated to the construction of science disposes the discussion of it in the direction of classical statistical methods that evolved alongside quantitative science in the 20th century. In particular, the methods of statistical hypothesis testing have come to be associated with the scientific enterprise. The unwholesome effects of this association are the subject of a brilliant article by Paul Meehl (1990) on the progress of "soft psychology"; see particularly the Appendix where Meehl briefly addresses meta-analysis. Just as scientists bring forth hypotheses to be accepted or rejected by data, so do statisticians devise the strategies by which data are judged to be in accord with or at odds with the hypotheses. This view of statistics gives the Handbook its other defining theme: meta-analyses involve the testing of statistical hypotheses about parameters in populations of research studies.
The appropriate role for inferential statistics in meta- analysis is not merely unclear, it is seen quite differently by different methodologists. These differences are not reflected in the Handbook. In 1981, in the first extended discussion of the topic, McGaw, Smith and I raised doubts about the applicability of inferential statistics in meta-analysis. Inference at the level of persons within studies (of the type addressed by Becker in Chapter 15, "Combining Significance Levels") seemed quite unnecessary to us, since even a modest size synthesis will involve a few hundred persons (nested within studies) and lead to nearly automatic rejection of null hypotheses. Moreover the chances are remote that these persons or subjects within studies were drawn from defined populations with anything approaching probabilistic techniques; hence, probabilistic calculations advanced as if subjects had been randomly selected are dubious. At the level of "studies," the question of the appropriateness of inferential statistics can be asked again, and the answer again seems to be negative. There are two instances in which common inferential methods are clearly appropriate: when a defined population has been randomly sampled and when subjects have been randomly assigned to conditions in a controlled experiment. In the latter case, Fisher showed how the permutation test can be used to make inferences to the universe of all possible permutations. But this case in of little interest to meta-analysts who never assign units to treatments. The typical meta-analysis virtually never meets the condition of probabilistic sampling of a population (though in one instance (Smith, Glass & Miller, 1980), the available population of drug treatment experiments was so large that it was in fact randomly sampled for the meta-analysis). Inferential statistics has little role to play in meta-analysis: "The probability conclusions of inferential statistics depend on something like probabilistic sampling, or else they make no sense." (p. 199)
It is common to acknowledge that many data sets fail to meet probabilistic sampling conditions, but to argue that one might well treat the data in hand "as if" it were a random sample of some hypothetical population. Under this supposition, inferential techniques are applied and the results inspected. The direction taken by the Handbook editors and authors mirrors the earliest published opinion on this problem, expressed by Mosteller and his colleagues in 1977: "One might expect that if our MEDLARS approach were perfect and produced all the papers we would have a census rather than a sample of the papers. To adopt this model would be to misunderstand our purpose. We think of a process producing these research studies through time, and we think of our sample--even if it were a census--as a sample in time from the process. Thus, our inference would still be to the general process, even if we did have all appropriate papers from a time period." (Gilbert, McPeek and Mosteller, 1977, p. 127; quoted in Cook et al., 1992, p. 291) This position is repeated in slightly different language by Hedges in Chapter 3, "Statistical Considerations": "The universe is the hypothetical collection of studies that could be conducted in principle and about which we wish to generalize. The study sample is the ensemble of studies that are used in the review and that provide the effect size data used in the research synthesis." (p. 30)
These notions appear to be circular. If the sample is fixed and the population is allowed to be hypothetical, then surely the data analyst will imagine a population that resembles the sample of data. Or as Gilbert, McPeek and Mosteller viewed it, the future will resemble the past if the past is all one has to go on. Hence all of these "hypothetical populations" will be merely reflections of the samples in hand and there will be no need for inferential statistics. Or put another way, if the population of inference is not defined by considerations separate from the characterization of the sample, then the population is merely a large version of the sample. With what confidence is one able to generalize the character of this sample to a population that looks like the sample writ large? Well, with a great deal of confidence, obviously. But then, the population is nothing but the sample.
Hedges and Olkin have developed inferential techniques that ignore the pro forma testing (because of large N) of null hypotheses and focus on the estimation of regression functions that estimate effects at different levels of study characteristics; nearly all of them appear in the Handbook. They worry about both sources of statistical instability: that arising from persons within studies and that which arises from variation between studies. As they properly point out, the study based on 5 persons deserves greater weight than the study based on 500 persons in determining the response of the treatment condition to changes in study conditions. The techniques they present are based on traditional assumptions of random sampling and independence. It is, of course, unclear precisely how the validity of their methods are compromised by failure to achieve probabilistic sampling of persons and studies.
The irony of traditional hypothesis testing approaches applied to meta-analysis is that whereas consideration of sampling error at the level of persons always leads to a pro forma rejection of "null hypotheses" (of zero correlation or zero average effect size), consideration of sampling error at the level of study characteristics (the study, not the person as the unit of analysis) leads to too few rejections (too many Type II errors, one might say). Hedges's homogeneity test of the hypothesis that all studies in a group estimate the same population parameter is the focus of much attention in the Handbook. Once a hypothesis of homogeneity is accepted by Hedges's test, one is advised to treat all studies within the ensemble as the same. Experienced data analysts know, however, that there is typically a good deal of meaningful covariation between study characteristics and study findings even within ensembles where Hedges's test can not reject the homogeneity hypothesis. The situation is nearly exactly parallel to the experience of psychometricians discovering that they could easily interpret several more factors than inferential solutions (maximum- likelihood; LISREL) could confirm. The best data exploration and discovery is more complex and credible than the most exact inferential test. In short, classical statistics seems not able to reproduce the complex cognitive processes that are commonly applied by data analysts.
Rubin (1990) addressed most of these issues squarely and staked out a radical position that appeals to the author of this review : "...consider the idea that sampling and representativeness of the studies in a meta-analysis are important. I will claim that this is nonsense--we don't have to worry about representing a population but rather about other far more important things." (p. 155) These more important things to Rubin are the estimation of treatment effects under a set of standard or ideal study conditions. This process, as he outlined it, involves the fitting of response surfaces (a form of quantitative model building) between study effects (Y) and study conditions (X, W, Z etc.). Of the 32 chapters in the Handbook, only the contribution of Light, Singer and Willett, Chapter 28, "the visual presentation and interpretation of meta-analyses," comes close to illustrating what Rubin has in mind. By far most meta-analyses are undertaken in pursuit not of scientific theory but technological evaluation. The evaluation question is never whether some hypothesis or model is accepted or rejected but rather how "outputs" or "benefits" or "effect sizes" vary from one set of circumstances to another; and the meta-analysis rarely works on a collection of data that can sensibly be described as a probability sample from anything.
Rubin's view of the meta-analysis enterprise would have produced a volume substantially different from that which Cooper and Hedges edited. So we can expect the Handbook of research synthesis to be not the last word on the subject, but one important word on meta-analysis.
References
Glass, G.V; McGaw, B. & Smith, M.L. (1981). Meta-Analysis in Social Research. Beverly Hills, CA: SAGE.
Rosenthal, R. (1984). Meta-Analytic Procedures for Social Research. Beverly Hills, CA: SAGE.
Rubin, D.R. (1990). A new perspective. Chp. 14 (pp. 155-165) in Wachter, K.W. and Straf, M.L. (Eds.), The Future of Meta-Analysis. N.Y., N.Y.: Russell Sage Foundation.
Smith, M.L.; Glass, G.V & Miller, T.I. (1980). Benefits of Psychotherapy. Baltimore, MD: Johns Hopkins University Press.
Gilbert, J.P.; McPeek, B. & Mosteller, F. (1977). Progress in surgery and anesthesia: benefits and risks of innovative surgery. In J. P. Bunker, B.A. Barnes & F. Mosteller (eds.) (1977). Costs, Risks and Benefits of Surgery. NY: Oxford University Press.
Cook, T.D.; Cooper, H; Cordray, D.S.; Hartmann, H; Hedges, L.V.; Light, R.J.; Louis, T.A.; & Mosteller, F. (1992). Meta-analysis for explanation: A casebook. New York: Russell Sage Foundation.
Meehl, P.E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195- 244. (Monograph Supplement 1-V66)
1993 Glass, G. V & Martinez, B. A. (1993, June 3). Politics of teacher evaluation. Proceedings of the CREATE Cross-Cutting Eval...