This chapter was written by Samuel Ball and originally published as an ETS report in 1979. Ball was one of ETS’s most active program evaluators for 10 years and directed several pacesetting studies, including a large-scale evaluation of the educational effects of Sesame Street. The chapter documents the vigorous program of evaluation research conducted at ETS in the 1960s and 1970s, which helped lay the foundation for this fledgling field. This work developed new viewpoints, techniques, and skills for systematically assessing educational programs and led to the creation of principles for program evaluation that still appear relevant today.
This chapter was written by Samuel Ball and originally published in 1979 by Educational Testing Service and later posthumously in 2011 as a research report in the ETS R&D Scientific and Policy Contributions Series. Ball was one of ETS’s most active program evaluators for 10 years and directed several pacesetting studies including a large-scale evaluation of Sesame Street . The chapter documents the vigorous program of evaluation research conducted at ETS in the 1960s and 1970s, which helped lay the foundation for what was then a fledgling field. This work developed new viewpoints, techniques, and skills for systematically assessing educational programs and led to the creation of principles for program evaluation that still appear relevant today.
You have full access to this open access chapter, Download chapter PDF
Evaluating educational programs is an emerging profession, and Educational Testing Service (ETS) has played an active role in its development. The term program evaluation only came into wide use in the mid-1960s, when efforts at systematically assessing programs multiplied. The purpose of this kind of evaluation is to provide information to decision makers who have responsibility for existing or proposed educational programs. For instance, program evaluation may be used to help make decisions concerning whether to develop a program (needs assessment), how best to develop a program ( formative evaluation ), and whether to modify—or even continue—an existing program ( summative evaluation ).
Needs assessment is the process by which one identifies needs and decides upon priorities among them. Formative evaluation refers to the process involved when the evaluator helps the program developer—by pretesting program materials, for example. Summative evaluation is the evaluation of the program after it is in operation. Arguments are rife among program evaluators about what kinds of information should be provided in each of these forms of evaluation.
In general, the ETS posture has been to try to obtain the best—that is, the most relevant, valid, and reliable—information that can be obtained within the constraints of cost and time and the needs of the various audiences for the evaluation. Sometimes, this means a tight experimental design with a national sample; at other times, the best information might be obtained through an intensive case study of a single institution. ETS has carried out both traditional and innovative evaluations of both traditional and innovative programs, and staff members also have cooperated with other institutions in planning or executing some aspects of evaluation studies. Along the way, the work by ETS has helped to develop new viewpoints, techniques, and skills.
Program evaluation calls for a wide range of skills, and evaluators come from a variety of disciplines: educational psychology, developmental psychology, psychometrics , sociology, statistics, anthropology, educational administration, and a host of subject matter areas. As program evaluation began to emerge as a professional concern, ETS changed, both structurally and functionally, to accommodate it. The structural changes were not exclusively tuned to the needs of conducting program evaluations. Rather, program evaluation, like the teaching of English in a well-run high school, became to some degree the concern of virtually all the professional staff. Thus, new research groups were added, and they augmented the organization’s capability to conduct program evaluations.
The functional response was many-faceted. Two of the earliest evaluation studies conducted by ETS indicate the breadth of the range of interest. In 1965, collaborating with the Pennsylvania State Department of Education, Henry Dyer of ETS set out to establish a set of educational goals against which later the performance of the state’s educational system could be evaluated (Dyer 1965a, b). A unique aspect of this endeavor was Dyer’s insistence that the goal-setting process be opened up to strong participation by the state’s citizens and not left solely to a professional or political elite. (In fact, ETS program evaluation has been marked by a strong emphasis, when at all appropriate, on obtaining community participation.)
The other early evaluation study in which ETS was involved was the now famous Coleman report ( Equality of Educational Opportunity ), issued in 1966 (Coleman et al. 1966). ETS staff, under the direction of Albert E. Beaton, had major responsibility for analysis of the massive data generated (see Beaton and Barone , Chap. 8, this volume). Until then, studies of the effectiveness of the nation’s schools, especially with respect to programs’ educational impact on minorities, had been small-scale. So the collection and analysis of data concerning tens of thousands of students and hundreds of schools and their communities were new experiences for ETS and for the profession of program evaluation.
In the intervening years , the Coleman report (Coleman et al. 1966) and the Pennsylvania Goals Study (Dyer 1965a, b) have become classics of their kind, and from these two auspicious early efforts, ETS has become a center of major program evaluation. Areas of focus include computer-aided instruction, aesthetics and creativity in education, educational television , educational programs for prison inmates, reading programs, camping programs, career education, bilingual education, higher education, preschool programs, special education, and drug programs. (For brief descriptions of ETS work in these areas, as well as for studies that developed relevant measures, see the appendix.) ETS also has evaluated programs relating to year-round schooling, English as a second language , desegregation, performance contracting, women’s education, busing, Title I of the Elementary and Secondary Education Act (ESEA) , accountability , and basic information systems.
One piece of work that must be mentioned is the Encyclopedia of Educational Evaluation , edited by Anderson et al. (1975). The encyclopedia contains articles by them and 36 other members of the ETS staff. Subtitled Concepts and Techniques for Evaluating Education and Training Programs, it contains 141 articles in all.
Given the innovativeness of many of the programs evaluated, the newness of the profession of program evaluation, and the level of expertise of the ETS staff who have directed these studies, it is not surprising that the evaluations themselves have been marked by innovations for the profession of program evaluation. At the same time, ETS has adopted several principles relative to each aspect of program evaluation. It will be useful to examine these innovations and principles in terms of the phases that a program evaluation usually attends to—goal setting, measurement selection, implementation in the field setting, analysis, and interpretation and presentation of evidence.
It would be a pleasure to report that virtually every educational program has a well-thought-through set of goals, but it is not so. It is, therefore, necessary at times for program evaluators to help verbalize and clarify the goals of a program to ensure that they are, at least, explicit. Further, the evaluator may even be given goal development as a primary task, as in the Pennsylvania Goals Study (Dyer 1965a, b). This need was seen again in a similar program, when Robert Feldmesser (1973) helped the New Jersey State Board of Education establish goals that underwrite conceptually that state’s “thorough and efficient” education program.
Work by ETS staff indicates there are four important principles with respect to program goal development and explication. The first of these principles is as follows: What program developers say their program goals are may bear only a passing resemblance to what the program in fact seems to be doing.
This principle—the occasional surrealistic quality of program goals—has been noted on a number of occasions: For example, assessment instruments developed for a program evaluation on the basis of the stated goals sometimes do not seem at all sensitive to the actual curriculum. As a result, ETS program evaluators seek, whenever possible, to cooperate with program developers to help fashion the goals statement. The evaluators also will attempt to describe the program in operation and relate that description to the stated goals, as in the case of the 1971 evaluation of the second year of Sesame Street for Children’s Television Workshop (Bogatz and Ball 1971). This comparison is an important part of the process and represents sometimes crucial information for decision makers concerned with developing or modifying a program.
The second principle is as follows: When program evaluators work cooperatively with developers in making program goals explicit, both the program and the evaluation seem to benefit.
The original Sesame Street evaluation (Ball and Bogatz, 1970) exemplified the usefulness of this cooperation. At the earliest planning sessions for the program, before it had a name and before it was fully funded, the developers, aided by ETS, hammered out the program goals. Thus, ETS was able to learn at the outset what the program developers had in mind, ensuring sufficient time to provide adequately developed measurement instruments. If the evaluation team had had to wait until the program itself was developed, there would not have been sufficient time to develop the instruments; more important, the evaluators might not have had sufficient understanding of the intended goals—thereby making sensible evaluation unlikely.
The third principle is as follows: There is often a great deal of empirical research to be conducted before program goals can be specified.
Sometimes, even before goals can be established or a program developed, it is necessary, through empirical research, to indicate that there is a need for the program. An illustration is provided by the research of Ruth Ekstrom and Marlaine Lockheed (1976) into the competencies gained by women through volunteer work and homemaking. The ETS researchers argued that it is desirable for women to resume their education if they wish to after years of absence. But what competencies have they picked up in the interim that might be worthy of academic credit? By identifying, surveying, and interviewing women who wished to return to formal education, Ekstrom and Lockheed established that many women had indeed learned valuable skills and knowledge. Colleges were alerted and some have begun to give credit where credit is due.
Similarly, when the federal government decided to make a concerted attack on the reading problem as it affects the total population, one area of concern was adult reading. But there was little knowledge about it. Was there an adult literacy problem? Could adults read with sufficient understanding such items as newspaper employment advertisements, shopping and movie advertisements, and bus schedules? And in investigating adult literacy , what characterized the reading tasks that should be taken into account? Murphy, in a 1973 study (Murphy 1973a), considered these factors: the importance of a task (the need to be able to read the material if only once a year as with income tax forms and instructions), the intensity of the task (a person who wants to work in the shipping department will have to read the shipping schedule each day), or the extensivity of the task (70% of the adult population read a newspaper but it can usually be ignored without gross problems arising). Murphy and other ETS researchers conducted surveys of reading habits and abilities , and this assessment of needs provided the government with information needed to decide on goals and develop appropriate programs.
Still a different kind of needs assessment was conducted by ETS researchers with respect to a school for learning disabled students in 1976 (Ball and Goldman 1976) . The school catered to children aged 5–18 and had four separate programs and sites. ETS first served as a catalyst, helping the school’s staff develop a listing of problems. Then ETS acted as an amicus curiae, drawing attention to those problems, making explicit and public what might have been unsaid for want of an appropriate forum. Solving these problems was the purpose of stating new institutional goals—goals that might never have been formally recognized if ETS had not worked with the school to make its needs explicit.
The fourth principle is as follows: The program evaluator should be conscious of and interested in the unintended outcomes of programs as well as the intended outcomes specified in the program’s goal statement.
In program evaluation, the importance of looking for side effects, especially negative ones, has to be considered against the need to put a major effort into assessing progress toward intended outcomes. Often, in this phase of evaluation, the varying interests of evaluators, developers, and funders intersect—and professional, financial, and political considerations are all at odds. At such times, program evaluation becomes as much an art form as an exercise in social science.
A number of articles were written about this problem by Samuel J. Messick , ETS vice president for research (e.g., Messick 1970, 1975). His viewpoint—the importance of the medical model—has been illustrated in various ETS evaluation studies. His major thesis was that the medical model of program evaluation explicitly recognizes that “…prescriptions for treatment and the evaluation of their effectiveness should take into account not only reported symptoms but other characteristics of the organism and its ecology as well” (Messick 1975, p. 245). As Messick went on to point out, this characterization was a call for a systems analysis approach to program evaluation—dealing empirically with the interrelatedness of all the factors and monitoring all outcomes, not just the intended ones.
When, for example, ETS evaluated the first 2 years of Sesame Street (Ball and Bogatz 1970), there was obviously pressure to ascertain whether the intended goals of that show were being attained. It was nonetheless possible to look for some of the more likely unintended outcomes: whether the show had negative effects on heavy viewers going off to kindergarten, and whether the show was achieving impacts in attitudinal areas.
In summative evaluations , to study unintended outcomes is bound to cost more money than to ignore them. It is often difficult to secure increased funding for this purpose. For educational programs with potential national applications, however, ETS strongly supports this more comprehensive approach.
The letters ETS have become almost synonymous in some circles with standardized testing of student achievement . In its program evaluations, ETS naturally uses such tests as appropriate, but frequently the standardized tests are not appropriate measures. In some evaluations, ETS uses both standardized and domain-referenced tests. An example may be seen in The Electric Company evaluations (Ball et al. 1974). This televised series, which was intended to teach reading skills to first through fourth graders, was evaluated in some 600 classrooms. One question that was asked during the process concerned the interaction of the student’s level of reading attainment and the effectiveness of viewing the series. Do good readers learn more from the series than poor readers? So standardized, norm-referenced reading tests were administered, and the students in each grade were divided into deciles on this basis, thereby yielding ten levels of reading attainment.
Data on the outcomes using the domain-referenced tests were subsequently analyzed for each decile ranking. Thus, ETS was able to specify for what level of reading attainment, in each grade, the series was working best. This kind of conclusion would not have been possible if a specially designed domain-referenced reading test with no external referent had been the only one used, nor if a standardized test, not sensitive to the program’s impact, had been the only one used.
Without denying the usefulness of previously designed and developed measures, ETS evaluators have frequently preferred to develop or adapt instruments that would be specifically sensitive to the tasks at hand. Sometimes this measurement effort is carried out in anticipation of the needs of program evaluators for a particular instrument, and sometimes because a current program evaluation requires immediate instrumentation.
An example of the former is a study of doctoral programs by Mary Jo Clark et al. (1976). Existing instruments had been based on surveys in which practitioners in a given discipline were asked to rate the quality of doctoral programs in that discipline. Instead of this reputational survey approach, the ETS team developed an array of criteria (e.g., faculty quality, student body quality, resources, academic offerings, alumni performance), all open to objective assessment. This assessment tool can be used to assess changes in the quality of the doctoral programs offered by major universities.
Similarly, the development by ETS of the Kit of Factor-Referenced Cognitive Tests (Ekstrom et al. 1976) also provided a tool—one that could be used when evaluating the cognitive abilities of teachers or students if these structures were of interest in a particular evaluation. A clearly useful application was in the California study of teaching performance by Frederick McDonald and Patricia Elias (1976). Teachers with certain kinds of cognitive structures were seen to have differential impacts on student achievement . In the Donald A. Trismen study of an aesthetics program (Trismen 1968), the factor kit was used to see whether cognitive structures interacted with aesthetic judgments.
Examples of the development of specific instrumentation for ETS program evaluations are numerous. Virtually every program evaluation involves, at the very least, some adapting of existing instruments. For example, a questionnaire or interview may be adapted from ones developed for earlier studies. Typically, however, new instruments, including goal-specific tests, are prepared. Some ingenious examples, based on the 1966 work of E. J. Webb, D. F. Campbell, R. D. Schwartz , and L. Sechrest , were suggested by Anderson (1968) for evaluating museum programs, and the title of her article gives a flavor of the unobtrusive measures illustrated—“Noseprints on the Glass.”
Another example of ingenuity is Trismen’s use of 35 mm slides as stimuli in the assessment battery of the Education through Vision program (Trismen 1968). Each slide presented an art masterpiece, and the response options were four abstract designs varying in color. The instruction to the student was to pick the design that best illustrated the masterpiece’s coloring.
When ETS evaluators have to assess a variable and the usual measures have rather high levels of error inherent in them, they usually resort to triangulation. That is, they use multiple measures of the same construct , knowing that each measure suffers from a specific weakness. Thus, in 1975, Donald E. Powers evaluated for the Philadelphia school system the impact of dual-audio television—a television show telecast at the same time as a designated FM radio station provided an appropriate educational commentary. One problem in measurement was assessing the amount of contact the student had with the dual-audio television treatment (Powers 1975a) . Powers used home telephone interviews, student questionnaires, and very simple knowledge tests of the characters in the shows to assess whether students had in fact been exposed to the treatment. Each of these three measures has problems associated with it, but the combination provided a useful assessment index.
In some circumstances, ETS evaluators are able to develop measurement techniques that are an integral part of the treatment itself. This unobtrusiveness has clear benefits and is most readily attainable with computer-aided instructional (CAI) programs. Thus, for example, Donald L. Alderman , in the evaluation of TICCIT (a CAI program developed by the Mitre Corporation), obtained for each student such indices as the number of lessons passed, the time spent on line, the number of errors made, and the kinds of errors (Alderman 1978). And he did this simply by programming the computer to save this information over given periods of time.
Measurement problems cannot be addressed satisfactorily if the setting in which the measures are to be administered is ignored. One of the clear lessons learned in ETS program evaluation studies is that measurement in field settings (home, school, community) poses different problems from measurement conducted in a laboratory.
Program evaluation, ether formative or summative, demands that its empirical elements usually be conducted in natural field settings rather than in more contrived settings, such as a laboratory. Nonetheless, the problems of working in field settings are rarely systematically discussed or researched. In an article in the Encyclopedia of Educational Evaluation , Bogatz (1975) detailed these major aspects:
Of course, all the aspects discussed by Bogatz interact with the measurement and design of the program evaluation. A great source of information concerning field operations is the ETS Head Start Longitudinal Study of Disadvantaged Children, directed by Virginia Shipman (1970). Although not primarily a program evaluation, it certainly has generated implications for early childhood programs. It was longitudinal, comprehensive in scope, and large in size, encompassing four sites and, initially, some 2000 preschoolers. It was clear from the outset that close community ties were essential if only for expediency—although, of course, more important ethical principles were involved. This close relationship with the communities in which the study was conducted involved using local residents as supervisors and testers, establishing local advisory committees, and thus ensuring free, two-way communication between the research team and the community.
The Sesame Street evaluation also adopted this approach (Ball and Bogatz 1970). In part because of time pressures and in part to ensure valid test results, the ETS evaluators especially developed the tests so that community members with minimal educational attainments could be trained quickly to administer them with proper skill.
In evaluations of street academies by Ronald L. Flaugher (1971), and of education programs in prisons by Flaugher and Samuel Barnett (1972), it was argued that one of the most important elements in successful field relationships is the time an evaluator spends getting to know the interests and concerns of various groups, and lowering barriers of suspicion that frequently separate the educated evaluator and the less-educated program participants. This point may not seem particularly sophisticated or complex, but many program evaluations have floundered because of an evaluator’s lack of regard for disadvantaged communities (Anderson 1970). Therefore, a firm principle underlying ETS program evaluation is to be concerned with the communities that provide the contexts for the programs being evaluated. Establishing two-way lines of communication with these communities and using community resources whenever possible help ensure a valid evaluation.
Even with the best possible community support, field settings cause problems for measurement. Raymond G. Wasdyke and Jerilee Grandy (1976) showed this idea to be true in an evaluation in which the field setting was literally that—a field setting. In studying the impact of a camping program on New York City grade school pupils, they recognized the need, common to most evaluations, to describe the treatment—in this case the camping experience. Therefore, ETS sent an observer to the campsite with the treatment groups. This person, who was herself skilled in camping, managed not to be an obtrusive participant by maintaining a relatively low profile.
Of course, the problems of the observer can be just as difficult in formal institutions as on the campground. In their 1974 evaluation of Open University materials, Hartnett and colleagues found, as have program evaluators in almost every situation, that there was some defensiveness in each of the institutions in which they worked (Hartnett et al. 1974). Both personal and professional contacts were used to allay suspicions. There also was emphasis on an evaluation design that took into account each institution’s values. That is, part of the evaluation was specific to the institution, but some common elements across institutions were retained. This strategy underscored the evaluators’ realization that each institution was different, but allowed ETS to study certain variables across all three participating institutions.
Breaking down the barriers in a field setting is one of the important elements of a successful evaluation, yet each situation demands somewhat different evaluator responses.
Another way of ensuring that evaluation field staff are accepted by program staff is to make the program staff active participants in the evaluation process. While this integration is obviously a technique to be strongly recommended in formative evaluations , it can also be used in summative evaluations . In his evaluation of PLATO in junior colleges, Murphy (1977) could not afford to become the victim of a program developer’s fear of an insensitive evaluator. He overcame this potential problem by enlisting the active participation of the junior college and program development staffs. One of Murphy’s concerns was that there is no common course across colleges. Introduction to Psychology, for example, might be taught virtually everywhere, but the content can change remarkably, depending on such factors as who teaches the course, where it is taught, and what text is used. Murphy understood this variability and his evaluation of PLATO reflected his concern. It also necessitated considerable input and cooperation from program developers and college teachers working in concert—with Murphy acting as the conductor.
After the principles and strategies used by program evaluators in their field operations are successful and data are obtained, there remains the important phase of data analysis. In practice, of course, the program evaluator thinks through the question of data analysis before entering the data collection phase. Plans for analysis help determine what measures to develop, what data to collect, and even, to some extent, how the field operation is to be conducted. Nonetheless, analysis plans drawn up early in the program evaluation cannot remain quite as immutable as the Mosaic Law. To illustrate the need for flexibility, it is useful to turn once again to the heuristic ETS evaluation of Sesame Street .
As initially planned, the design of the Sesame Street evaluation was a true experiment (Ball and Bogatz 1970) . The analyses called for were multivariate analyses of covariance, using pretest scores as the covariate. At each site, a pool of eligible preschoolers was obtained by community census, and experimental and control groups were formed by random assignment from these pools. The evaluators were somewhat concerned that those designated to be the experimental (viewing) group might not view the show—it was a new show on public television, a loose network of TV stations not noted for high viewership. Some members of the Sesame Street national research advisory committee counseled ETS to consider paying the experimental group to view. The suggestion was resisted, however, because any efforts above mild and occasional verbal encouragement to view the show would compromise the results. If the experimental group members were paid, and if they then viewed extensively and outperformed the control group at posttest, would the improved performance be due to the viewing, the payment, or some interaction of payment and viewing? Of course, this nice argument proved to be not much more than an exercise in modern scholasticism. In fact, the problem lay not in the treatment group but in the uninformed and unencouraged-to-view control group. The members of that group, as indeed preschoolers with access to public television throughout the nation, were viewing the show with considerable frequency—and not much less than the experimental group. Thus, the planned analysis involving differences in posttest attainments between the two groups was dealt a mortal blow.
Fortunately, other analyses were available, of which the ETS-refined age cohorts design provided a rational basis. This design is presented in the relevant report (Ball and Bogatz 1970). The need here is not to describe the design and analysis but to emphasize a point made practically by the poet Robert Burns some time ago and repeated here more prosaically: The best laid plans of evaluators can “gang aft agley,” too.
Sometimes program evaluators find that the design and analysis they have in mind represent an untrodden path. This result is perhaps in part because many of the designs in the social sciences are built upon laboratory conditions and simply are not particularly relevant to what happens in educational institutions.
When ETS designed the summative evaluation of The Electric Company , it was able to set up a true experiment in the schools. Pairs of comparable classrooms within a school and within a grade were designated as the pool with which to work. One of each pair of classes was randomly assigned to view the series. Pretest scores were used as covariates on posttest scores, and in 1973 the first-year evaluation analysis was successfully carried out (Ball and Bogatz 1973). The evaluation was continued through a second year, however, and as is usual in schools, the classes did not remain intact.
From an initial 200 classes, the children had scattered through many more classrooms. Virtually none of the classes with subject children contained only experimental or only control children from the previous year. Donald B. Rubin , an ETS statistician, consulted with a variety of authorities and found that the design and analysis problem for the second year of the evaluation had not been addressed in previous work. To summarize the solution decided on, the new pool of classes was reassigned randomly to E (experimental) or C (control) conditions so that over the 2 years the design was portrayable as Fig. 11.1.
Further, the pretest scores of Year II were usable as new covariates when analyzing the results of the Year II posttest scores (Ball et al. 1974).
Unfortunately for those who prefer routine procedures, it has been shown across a wide range of ETS program evaluations that each design and analysis must be tailored to the occasion. Thus, Gary Marco (1972), as part of the statewide educational assessment in Michigan, evaluated ESEA Title I program performance. He assessed the amount of exposure students had to various clusters of Title I programs, and he included control schools in the analysis. He found that a regression -analysis model involving a correction for measurement error was an innovative approach that best fit his complex configuration of data.
Garlie Forehand , Marjorie Ragosta , and Donald A. Rock , in a national, correlational study of desegregation, obtained data on school characteristics and on student outcomes (Forehand et al. 1976) . The purposes of the study included defining indicators of effective desegregation and discriminating between more and less effective school desegregation programs. The emphasis throughout the effort was on variables that were manipulable. That is, the idea was that evaluators would be able to suggest practical advice on what schools can do to achieve a productive desegregation program. Initial investigations allowed specification among the myriad variables of a hypothesized set of causal relationships, and the use of path analysis made possible estimation of the strength of hypothesized causal relationships. On the basis of the initial correlation matrices, the path analyses, and the observations made during the study, an important product—a nontechnical handbook for use in schools—was developed.
Another large-scale ETS evaluation effort was directed by Trismen et al. (1976). They studied compensatory reading programs, initially surveying more than 700 schools across the country. Over a 4-year period ending in 1976, this evaluation interspersed data analysis with new data collection efforts. One purpose was to find schools that provided exceptionally positive or negative program results. These schools were visited blind and observed by ETS staff. Whereas the Forehand evaluation analysis (Forehand et al. 1976) was geared to obtaining practical applications, the equally extensive evaluation analysis of Trismen’s study was aimed at generating hypotheses to be tested in a series of smaller experiments.
As a further illustration of the complex interrelationship among evaluation purposes, design, analyses, and products, there is the 1977 evaluation of the use of PLATO in the elementary school by Spencer Swinton and Marianne Amarel (1978). They used a form of regression analysis—as did Forehand et al. (1976) and Trismen et al. (1976). But here the regression analyses were used differently in order to identify program effects unconfounded by teacher differences. In this regression analysis, teachers became fixed effects, and contrasts were fitted for each within-teacher pair (experimental versus control classroom teachers).
This design, in turn, provides a contrast to McDonald’s (1977) evaluation of West New York programs to teach English as a second language to adults. In this instance, the regression analysis was directed toward showing which teaching method related most to gains in adult students’ performance.
There is a school of thought within the evaluation profession that design and analysis in program evaluation can be made routine. At this point, the experience of ETS indicates that this would be unwise.
Possibly the most important principle in program evaluation is that interpretations of the evaluation’s meaning—the conclusions to be drawn—are often open to various nuances. Another problem is that the evidence on which the interpretations are based may be inconsistent. The initial premise of this chapter was that the role of program evaluation is to provide evidence for decision-makers. Thus, one could argue that differences in interpretation, and inconsistencies in the evidence, are simply problems for the decision-maker and not for the evaluator.
But consider, for example, an evaluation by Powers of a year-round program in a school district in Virginia (Powers 1974, 1975b). (The long vacation was staggered around the year so that schools remained open in the summer.) The evidence presented by Powers indicated that the year-round school program provided a better utilization of physical plant and that student performance was not negatively affected. The school board considered this evidence as well as other conflicting evidence provided by Powers that the parents’ attitudes were decidedly negative. The board made up its mind, and (not surprisingly) scotched the program. Clearly, however, the decision was not up to Powers. His role was to collect the evidence and present it systematically.
In general, the ETS response to conflicting evidence or varieties of nuances in interpretation is to keep the evaluation process and its reporting as open as possible. In this way, the values of the evaluator, though necessarily present, are less likely to be a predominating influence on subsequent action.
Program evaluators do, at times, have the opportunity to influence decision-makers by showing them that there are kinds of evidence not typically considered. The Coleman Study, for example, showed at least some decision-makers that there is more to evaluating school programs than counting (or calculating) the numbers of books in libraries, the amount of classroom space per student, the student-teacher ratio, and the availability of audiovisual equipment (Coleman et al. 1966). Rather, the output of the schools in terms of student performance was shown to be generally superior as evidence of school program performance.
Through their work, evaluators are also able to educate decision makers to consider the important principle that educational treatments may have positive effects for some students and negative effects for others—that an interaction of treatment with student should be looked for. As pointed out in the discussion of unintended outcomes, a systems-analysis approach to program evaluation—dealing empirically with the interrelatedness of all the factors that may affect performance—is to be preferred. And this approach, as Messick emphasized, “properly takes into account those student-process-environment interactions that produce differential results” (Messick 1975, p. 246).
Finally, a consideration of the kinds of evidence and interpretations to be provided decision makers leads inexorably to the realization that different kinds of evidence are needed, depending on the decision-maker’s problems and the availability of resources. The most scientific evidence involving objective data on student performance can be brilliantly interpreted by an evaluator, but it might also be an abomination to a decision maker who really needs to know whether teachers’ attitudes are favorable.
ETS evaluations have provided a great variety of evidence. For a formative evaluation in Brevard County, Florida , Trismen (1970) provided evidence that students could make intelligent choices about courses. In the ungraded schools, students had considerable freedom of choice, but they and their counselors needed considerably more information than in traditional schools about the ingredients for success in each of the available courses. As another example, Gary Echternacht , George Temp, and Theodore Stolie helped state and local education authorities develop Title I reporting models that included evidence on impact, cost, and compliance with federal regulations (Echternacht et al. 1976). Forehand and McDonald (1972) had been working with New York City to develop an accountability model providing constructive kinds of evidence for the city’s school system. On the other hand, as part of an evaluation team, Amarel provided, for a small experimental school in Chicago, judgmental data as well as reports and documents based on the school’s own records and files (Amarel and The Evaluation Collective 1979). Finally, Michael Rosenfeld provided Montgomery Township, New Jersey, with student, teacher, and parent perceptions in his evaluation of the open classroom approach then being tried out (Rosenfeld 1973).
In short, just as tests are not valid or invalid (it is the ways tests are used that deserve such descriptions), so too, evidence is not good or bad until it is seen in relation to the purpose for which it is to be used, and in relation to its utility to decision-makers.
For the most part, ETS’s involvement in program evaluation has been at the practical level. Without an accompanying concern for the theoretical and professional issues, however, practical involvement would be irresponsible. ETS staff members have therefore seen the need to integrate and systematize knowledge about program evaluation. Thus, Anderson obtained a contract with the Office of Naval Research to draw together the accumulated knowledge of professionals from inside and outside ETS on the topic of program evaluation. A number of products followed. These products included a survey of practices in program evaluation (Ball and Anderson 1975a), and a codification of program evaluation principles and issues (Ball and Anderson 1975b). Perhaps the most generally useful of the products is the aforementioned Encyclopedia of Educational Evaluation (Anderson et al. 1975).
From an uncoordinated, nonprescient beginning in the mid-1960s, ETS has acquired a great deal of experience in program evaluation. In one sense it remains uncoordinated because there is no specific “party line,” no dogma designed to ensure ritualized responses. It remains quite possible for different program evaluators at ETS to recommend differently designed evaluations for the same burgeoning or existing programs.
There is no sure knowledge where the profession of program evaluation is going. Perhaps, with zero-based budgeting, program evaluation will experience amazing growth over the next decade, growth that will dwarf its current status (which already dwarfs its status of a decade ago). Or perhaps there will be a revulsion against the use of social scientific techniques within the political, value-dominated arena of program development and justification. At ETS, the consensus is that continued growth is the more likely event. And with the staff’s variegated backgrounds and accumulating expertise, ETS hopes to continue making significant contributions to this emerging profession.