More and more studies challenging test-based corporate 'school reform' claims — and 'value-added' attacks on teachers

As the March on Washington continues to gain momentum, one of the important things almost obscured in the glut of instant news and media nonsense (more than 200 reporters and photographers were covering the resignation of Congressman Anthony Weiner on June 16, 2011) is the fact that the entire structure of test-based 'accountability' systems, central both to corporate school reform and the policies of the U.S. Department of Education under Arne Duncan, is collapsing under the weight of critical studies that come from across the political spectrum. While critics from the right (even The American Enterprise Institute) are releasing studies critical of the current fads, and critics from the left (like the teacher unions) are all proving the same thing, the U.S. Department of Education and state school leaders are still pursuing a program that most evidence now shows has failed.

So here is a potpourri of recent articles.

The first is from blogger Dana Goldstein:


Although much of the Obama administration's education reform agenda promotes test score-based teacher evaluation and pay, the tide seems to be significantly turning against such policies, at least among wonks and academics. Last week the National Academies of Science published a synthesis of 10 years worth of research on 15 American test-based incentive programs, finding they demonstrated few good results and a lot of negative unintended consequences.

Meanwhile, the National Center on Education and the Economy reported that high-achieving nations have focused on reforming their teacher education and professional development pipelines, not on efforts to measure student "growth" and tie such numbers to individual teachers.

Today, a paper coauthored by the Asia Society and the Department of Education itself calls Singapore a model for teacher evaluation. That nation's teachers are assessed on four "holistic" qualities, including the "character development of their students" and "their relationship to community organizations and to parents." There is no attempt to create a mathematical formula to tie student test scores to teacher evaluation or pay.

Lastly, even the free-market American Enterprise Institute has a new paper, by Fairfax County, Virginia Superintendent Jack Dale, arguing that the path forward should be differentiated pay based on teams of teachers taking on additional mentoring, curriculum development, and planning responsibilities. Test-based merit pay plans "miss a crucial point: teaching must be a collaborative team effort, and incentivizing individual teachers will not accomplish our ambitious goal," Dale writes.

Yes, there's a lot there to digest. The good news is, there are also some exciting policy alternatives.

After The American Prospect published, "The Test Generation," my feature story about different models for teacher evaluation in Colorado, a number of readers challenged my suggestion that policy makers have more to learn from Denver's Math and Science Leadership Academy, which practices teacher peer-review, than from Harrison District 2 in Colorado Springs, which runs a merit pay program tied to student test scores. MSLA, they said, is a small school in which it's easy to build trust among peers. It can practice extreme disretion in hiring, so it's less likely there will be bad teachers to weed out later on. All that is true in the case of MSLA, although we also know peer-review has also worked in some large American school districts, most notably Columbus and Toledo Ohio, both of which weeded out a significant number of poor-performing teachers using such systems. Now the New York Times' Michael Winerip profiles PAR, the teacher peer-review plan in Montgomery County, Maryland, which has fired 200 poor-performing teachers and encouraged another 300 to quit since its inception 11 years ago.

Unfortunately, federal dollars from the Obama administration’s Race to the Top program are not going where Dr. Weast and the PAR program need to go. Montgomery County schools were entitled to $12 million from Race to the Top, but Dr. Weast said he would not take the money because the grant required districts to include students’ state test results as a measure of teacher quality. “We don’t believe the tests are reliable,” he said. “You don’t want to turn your system into a test factory.”

Weast, Montgomery's superintendent, is a visionary guy who speaks frequently about the need to build relationships of trust between communities, school administrators, and teachers--and actually follows up on the rhetoric with great policy-making. I'll give him the last word, from an April interview with the Washington Post:

You have close relations with labor.

"I have close relations with people who work in the school business. They happen to be unionized, and I find that good, because it’s easier to actually visit with them because they have an organized structure. We have 22,000 employees. It’s just hard to have a sit-down conversation with all 22,000 of them."

Is there a downside to working with unions?



Panel Finds Few Learning Gains From Testing Movement, By Sarah D. Sparks

Nearly a decade of America’s test-based accountability systems, from “adequate yearly progress” to high school exit exams, has shown little to no positive effect overall on learning and insufficient safeguards against gaming the system, a blue-ribbon committee of the National Academies of Science concludes in a new report.

“Too often it’s taken for granted that the test being used for the incentive is itself the marker of progress, and what we’re trying to say here is you need an independent assessment of progress,” said Michael Hout, the sociology chair at the University of California, Berkeley. He is the chairman of the 17-member committee, a veritable who’s who of national experts in education law, economics and social sciences that was launched in 2002 by the National Academies, a private, nonprofit quartet of institutions chartered by Congress to provide science, technology and health-policy advice.

During the last 10 years, the committee has been tracking the implementation and effectiveness of 15 test-based incentive programs, including:

• National school improvement programs under the No Child Left Behind Act and prior iterations of the Elementary and Secondary Education Act;

• Test-based teacher incentive-pay systems in Texas, Chicago, Nashville, Tenn., and elsewhere;

• High school exit exams adopted by about half of states;

• Pay-for-scores programs for students in New York City and Coshocton, Ohio and;

• Experiments in teacher incentive-pay in India and student and teacher test incentives in Israel and Kenya.

On the whole, the panel found the accountability programs often used assessments too narrow to accurately measure progress on program goals and used rewards or sanctions not directly tied to the people whose behavior the programs wanted to change. Moreover, the programs often had insufficient safeguards and monitoring to prevent students or staff from simply gaming the system to produce high test scores disconnected from the learning the tests were meant to inspire.

“I think there are some real messages for school districts on accountability systems” in the report, said Kevin Lang, an economics professor at Boston University who, during his time on the committee, also served as a district school board member in Brookline, Mass.

“School boards need to have a means for monitoring the progress of their school systems, and they tend to do it by looking at test scores,” he said. “It’s not that there’s no information in the objective performance measures, but they are imperfect, and including the subjective performance measures is also very important. Incentives can be powerful, but not necessarily in the way you would like them to be powerful.”

Gaming the System

Among the most common problems the report identifies is that most test-based accountability programs use the same test to apply sanctions and rewards as to evaluate objectively whether the system works. As a result, staff and students facing accountability sanctions tend to focus on behavior that improves the test score alone, such as teaching test-taking strategies or drilling students who are closest to meeting the proficiency cut-score, rather than improving the overall learning that the test score is expected to measure. This undercuts the validity of the test itself.

Committee on Incentives and Test-Based Accountability

Michael Hout (Chair)*

Sociology Chairman

University of California; Berkley

Dan Ariely

Professor of Psychology and Behavioral Economics

Duke University; Durham, N.C.

George P. Baker III

Professor of Business Administration

Harvard Business School; Boston

Henry Braun

Professor of Education and Public Policy; Director of the Center for the Student of Testing, Evaluation, and Educational Policy

Boston College; Chestnut Hill, Mass.

Anthony S. Bryk (until 2008)


Carnegie Foundation for the Advancement of Teaching; Stanford, Calif.

Edward L. Deci

Professor of Psychology and Social Sciences; Director of the Human Motivation Program

University of Rochester; Rochester, N.Y.

Christopher Edley Jr.

Professor and Dean of Law

University of California; Berkeley

Geno J. Flores

Former Chief Deputy, Superintendent of Public Instruction

California Department of Education

Carolyn J. Heinrich

Professor and Director of Public Affairs; Affiliated Professor of Economics

University of Wisconsin-Madison

Paul T. Hill

Research Professor; Director of the Center on Reinventing Public Education

University of Washington Bothell

Thomas J. Kane**

Professor of Education and Economics; Director of the Center for Education Policy Research

Harvard University; Cambridge, Mass.

Daniel M. Koretz

Professor of Education

Harvard University; Cambridge, Mass.

Kevin Lang

Professor of Economics

Boston University; Boston

Susanna Loeb

Professor of Education

Stanford University; Stanford, Calif.

Michael Lovaglia

Professor of Sociology; Director of the Center for the Study of Group Processes

University of Iowa; Iowa City

Lorrie A. Shepard

Dean and Professor of Education

University of Colorado at Boulder

Brian M. Stecher

Associate Director for Education

Rand Corp.; Santa Monica, Calif.

* Member, National Academy of Sciences

** Was not able to participate in the final committee deliberations due to scheduling conflict.

SOURCE: National Academies

For example, New York’s requirement that all high school seniors pass the Regents exam before graduating high school led to more students passing the Regents tests, but scores on the lower-stakes National Assessment of Educational Progress, which was testing the same subjects, didn’t budge during the same time period, the report found.

“It’s human nature: Give me a number, I’ll hit it,” Mr. Hout said. “Consequently, something that was a really good indicator before there were incentives on it, be it test scores or the stock price, becomes useless because people are messing with it.”

In fact, the report found that, rather than leading to higher academic achievement, high school exit exams so far have decreased high school graduation rates nationwide by an average of about 2 percentage points.

The study found a growing body of evidence of schools and districts tinkering with how and when students took the test to boost scores on paper for students who did not know the material—or to prevent those students from taking the test at all.

Recent changes to federal requirements for reporting graduation rates, which require that schools count as dropouts students who “transfer” to a school that does not award diplomas, may help safeguard against schools pushing out students to improve test scores or graduation rates. Still, the National Academies researchers warned that state and federal officials do not provide enough outside monitoring and evaluations to ensure the programs work as intended.

AYP and Academics

For similar reasons, school-based accountability mechanisms under NCLB have generated minimal improvement in academic learning, the study found. When the systems are evaluated—not using the high-stakes tests subject to inflation, but using instead outside comparison tests, such as the NAEP—student achievement gains dwindle to about .08 of a standard deviation on average, mostly clustered in elementary-grade mathematics.

To give some perspective, an intervention considered to have a small effect size is usually about .1 standard deviations; a 2010 federal study of reading-comprehension programs found a moderately successful program had an effect size of .22 standard deviations.

Moreover, “as disappointing as a .08 standard deviation might be, that’s bigger than any effect we saw for incentives on individual students,” Mr. Hout said, noting that NCLB accountability measures school performance, not that of individual students

Committee members see some hopeful signs in the 2008 federal requirement that NAEP scores be used as an outside check on achievement results reported by districts and states, as well as the broader political push to incorporate more diverse measures of student achievement in the next iteration of ESEA.

“We need to look seriously at the costs and benefits of these programs,” said Daniel M. Koretz, a committee member and an education professor at Harvard University Graduate School of Education in Cambridge, Mass. “We have put a lot into these programs over a period of many years, and the positive effects when we can find them have been pretty disappointing.”

Jon Baron, the president of the Washington-based Coalition for Evidence-Based Policy and the chairman of the National Board for Education Sciences, which advises the Education Department’s research arm, said he was impressed by the quality of the committee’s research review but unsurprised at minimal results for the various incentive programs.

Incorporating diverse types of studies typically reduces the overall effects found for them, he noted, adding that the study also addresses a broader issue. “One of the contributions that this makes is that it shows that looking across all these different studies with different methodologies and populations, some in different countries, there are very minimal effects in many cases and in a few cases larger effects. It makes the argument that details matter,” Mr. Baron said.

“It’s an antidote to what has been the accepted wisdom in this country, the belief that performance-based accountability and incentive systems are the answer to improving education,” Mr. Baron said. “That was basically accepted without evidence or support in NCLB and other government and private sector efforts to increase performance.”

Vol. 30, Issue 33


Add your own comment (all fields are necessary)

Substance readers:

You must give your first name and last name under "Name" when you post a comment at We are not operating a blog and do not allow anonymous or pseudonymous comments. Our readers deserve to know who is commenting, just as they deserve to know the source of our news reports and analysis.

Please respect this, and also provide us with an accurate e-mail address.

Thank you,

The Editors of Substance

Your Name

Your Email

What's your comment about?

Your Comment

Please answer this to prove you're not a robot:

3 + 4 =