What Actually Buys Teacher Quality? Three Levers, Ranked.
TL;DR
- The US education debate spends most of its time arguing about pay-for-performance and across-the-board teacher raises, and almost no time on the thing the evidence most strongly supports: replacing the worst teachers.
- Pay-for-performance for incumbent teachers is a precisely-estimated zero in every well-powered American experiment. Bonuses up to $15,000 per teacher in Nashville for three years moved nothing. The same kind of bonus works in India, because in India the bottleneck is teacher effort. In the US the bottleneck is teacher quality, and money can't motivate quality into existence.
- Raising the ceiling on teacher pay, so the best teachers earn substantially more than the average ones (which is what Singapore does and the US doesn't), might be the most underrated lever in education policy. Nobody is talking about cutting bottom-quartile pay. The point is that the current flat schedule keeps the top of the profession uncompetitive with every other career path that high-aptitude college graduates now have access to.
- The most-cited fact in "education saves lives" advocacy turns out to not replicate. When researchers in the UK and Sweden re-ran the same identification strategy that produced the original finding, they got null results. Drop that one from your priors.
- The strongest equity claim in the entire literature is that one additional high school graduation reduces a Black man's incarceration probability by 3.4 percentage points, which is 4.5 times the effect for white men. The case for targeted K-12 spending on low-income, predominantly minority kids is unusually strong once you stop measuring it on test scores alone.
The wrong question
"Does more money for public schools work?" has been the most studied and least resolved question in American social science since the Coleman Report in 1966. You can defend any answer to it. Hanushek's correlational meta-analyses said no. Jackson, Johnson, and Persico's 2016 paper in the Quarterly Journal of Economics said yes: a 10 percent increase in K-12 spending for the full 12 years of school produces a 7.25 percent boost in adult earnings, with effects roughly 2.5 times larger for low-income kids. Both findings are correct about different things.
The right question is narrower. Which kind of spending, on which kids, against which outcome, on which time horizon? Run that decomposition and you stop debating "more money" and start ranking the actual policy levers by how well they work.
For teacher quality, which a generation of research has established as the single most important variable inside any school, there are three of them. They are not equally supported by the evidence.
Path A: Fire the worst teachers and replace them
This is the lever that works. It is also the lever almost nobody will use.
Washington DC implemented it under the IMPACT program starting in 2009. Teachers were evaluated through multiple measures (classroom observations, value-added on student test scores, administrative measures), and those rated "ineffective" twice in a row, or "minimally effective" twice, were dismissed. Two researchers, Thomas Dee and James Wyckoff, used a clean comparison of teachers who just barely fell below the dismissal threshold against teachers who just barely cleared it.
The results, published in 2015:
- Teachers under dismissal threat left voluntarily 11 percentage points more often than they otherwise would have. The threat itself was the lever, because most low performers exited before the formal process required them to.
- Among the low performers who stayed, performance improved by 0.27 standard deviations.
- The high performers near the bonus threshold also improved, by 0.24 standard deviations.
The downstream effect is what matters. A follow-up paper by Adnot, Dee, Katz, and Wyckoff in 2017 tracked what happened to students whose low-rated teachers got pushed out. Those students gained about 0.08 standard deviations per year in math, compared to students whose similarly-rated teachers were retained. The mechanism is mechanical: when a below-average teacher leaves, their replacement is drawn from the workforce average, which by definition is better.
How big is this lever, exactly? Raj Chetty, John Friedman, and Jonah Rockoff worked it out in two landmark American Economic Review papers in 2014, using tax records and school data for more than a million children. A single year with a teacher one standard deviation better than average raises that student's adult earnings at age 28 by about 1 percent. Aggregated across a classroom of 28 students, the lifetime present-value impact of replacing a bottom-5-percent teacher with an average one is roughly $250,000 per classroom. Eric Hanushek's macro simulation in 2011 estimated that if every US district replaced its bottom 5 to 8 percent of teachers each year, the US would move from middling to near the top of international rankings.
The political economy of doing this at scale is brutal. DC IMPACT required mayoral control of the schools, a high-profile chancellor willing to absorb the union battle (Michelle Rhee), and contract terms most districts can't negotiate. Newark, Houston, and New Haven all attempted IMPACT-style evaluation programs in the 2010s, and all of them walked the dismissal component back, partially or fully, as political costs accumulated. The lever works. The conditions to use it are rare.
Path B: Pay teachers for performance
This is what every reform-minded politician proposes. The American evidence is not ambiguous.
In 2007, New York City started a school-wide bonus program. About 200 schools were randomly assigned to receive the bonus, about 200 controls did not. Teachers in treatment schools were eligible for bonuses of up to $3,000 paid to the school based on student performance. Roland Fryer's 2013 analysis in the Journal of Labor Economics found no significant effect on math, no significant effect on reading, and the same null across every grade level and demographic subgroup. This was a precisely-estimated zero, not a small underpowered study failing to detect an effect, but a large clean experiment finding nothing.
Nashville ran a more aggressive version starting in 2006. The Project on Incentives in Teaching (POINT) offered individual teacher bonuses of up to $15,000, which was about a third of starting salary in Nashville at the time, tied to individual student achievement gains. Three full years, then evaluated. Effect on math, year 1: roughly 0.01 standard deviations, not significant. Year 2: 0.02, not significant. Year 3: zero, not significant. Reading scores moved nowhere across all three years.
The Nashville result is what ruled out the easy explanations for the NYC null. The bonus was much larger. It was paid to individual teachers, not pooled to schools. It ran long enough to test sustained effects. None of that mattered for outcomes.
A 2021 meta-analysis in the Journal of Public Economics pooled the US evidence and found the same pattern. Pay-for-performance produces small positive effects only when bundled with dismissal threat. When you isolate the incentive lever from the selection lever, it does almost nothing.
The international comparison is the most useful one. Karthik Muralidharan and Venkatesh Sundararaman ran an enormous experiment in 2011 across rural government schools in Andhra Pradesh, India. Individual bonuses worth just 3 percent of pay, much smaller than POINT, produced large effects: +0.28 standard deviations in math, +0.16 in language by year two. Group bonuses worked nearly as well.
The reason it works in India is that teacher absenteeism in rural Indian public schools runs 20 to 30 percent. The binding constraint is effort, meaning getting teachers to show up and teach. Bonuses tied to outcomes pull on that lever directly. In the United States, teachers show up. They prepare. They try. The thing that varies enormously is teacher quality: innate skill, content knowledge, classroom management ability. That isn't something a bonus can produce out of an underprepared teacher. Pay-for-performance works on effort, and effort isn't the bottleneck here.
The one US exception is interesting. Fryer, Steven Levitt, John List, and Sally Sadoff ran a small experiment in Chicago Heights, Illinois in 2012 with the same bonus structure, except they paid the teachers up front, at the start of the year. If students didn't hit the target, the teachers had to give the money back. Loss aversion did what gain framing couldn't, producing +0.20 standard deviations in math. One district, one study, never replicated. If you want pay-for-performance to work in the US, you probably have to hack System 1 cognition rather than appeal to rational incentive.

Path C: Pay high-quality teachers a lot more than low-quality ones
This is the lever the literature gestures at but cannot cleanly evaluate, because almost no US district has tried it.
Caroline Hoxby and Andrew Leigh published a study in 2004 documenting one of the more dramatic occupational shifts in American history. Between 1963 and 2000, the share of top-5-percent-aptitude female college graduates who became teachers fell from 20 percent to 4 percent. The share of the teaching workforce drawn from the top 5 percent of aptitude fell from 5 percent to 1 percent.
Their identification strategy used variation in state union-enabling laws as an instrument for how flat the teacher pay scale became, meaning how little a top-quartile teacher earned compared to a bottom-quartile one in the same grade. Their decomposition: about 80 percent of the top-aptitude exit was driven by that flattening of the pay scale, and only about 9 percent by women gaining wage parity in other professions.
If true, that is an enormous policy claim. The US salary schedule for teachers is famously flat: a top-quartile teacher earns about 1.3 times what a bottom-quartile teacher does in the same grade. Singapore, which recruits its teachers from the top 30 percent of college graduates, pays its best teachers something more like 2 to 3 times what its weakest are paid, and the ceiling rises aggressively over a career. Their salary structure is what you'd design if you actually wanted to attract top talent.
The American version doesn't require cutting anyone's pay. A back-of-envelope cost-benefit estimate, using the Hoxby-Leigh elasticity directly, suggests raising mean teacher pay 10 percent while restructuring so top teachers can earn 50 percent above the median (and median teachers keep their current pay) would produce a benefit-cost ratio of around 4.5 to 1 over 50 years. That puts it in the same league as Perry Preschool and well above across-the-board raises that preserve the flat structure.
Where the case gets weaker
The Hoxby-Leigh framing oversimplifies. Their instrumental variable identifies pay-scale flattening cleanly. It does not identify the effect of access barriers falling in other professions. When Title IX passed in 1972, when medical schools dropped explicit gender quotas, when the American Bar Association abandoned its custom against women partners, when Harvard Business School first admitted women in 1963, none of these changes show up well in a model that proxies "outside options" with realized wages.
When you go look at where the missing 16 percentage points of top-aptitude female college graduates actually ended up, the story shifts:

| Destination (top-quintile female college grads) | 1963 | 2000 | Change |
|---|---|---|---|
| Teaching | ~20% | ~4% | −16 pp |
| Business / finance / management | ~3% | ~16% | +13 pp |
| Law / medicine / dentistry | ~1.5% | ~7% | +5.5 pp |
| Engineering / CS / hard sciences | <1% | ~3.5% | +3 pp |
| Research-university faculty | ~1% | ~3% | +2 pp |
| Newly-opened professions, summed | ~5-7% | ~28-32% | +~22 pp |
| Nursing / social work | ~15% | ~8% | −7 pp |
| Clerical / administrative | ~25% | ~10% | −15 pp |
The newly-opened professions absorbed more (~22 pp) than teaching lost (16 pp). The extra inflow came out of clerical work and traditional female-coded jobs. At a macroeconomic level, Hsieh, Hurst, Jones, and Klenow estimated in 2019 that the removal of occupational barriers facing women and Black Americans accounts for 15 to 20 percent of total US economic growth per worker since 1960. That is not a small mechanism.
A more honest reconciliation: roughly 40 to 50 percent of the top-aptitude teaching exit was driven by access barriers falling, which is a labor market correcting a historical injustice rather than a problem to undo. Roughly 30 to 40 percent was the flat pay scale itself, the part that is actually policy-reversible. The rest is wage parity, the pill (which Claudia Goldin and Lawrence Katz showed raised the expected horizon of women's careers), and residual.
This trims the addressable wedge. The cost-benefit estimate falls from 4.5x to something more like 2 to 3x. Still positive, still defensible, but not a slam dunk.
The honest framing isn't "recover the talent we lost." It is: stop running a compensation structure designed for an era when teaching could free-ride on the labor market's exclusion of high-aptitude women.

One claim you should drop from your priors
Adriana Lleras-Muney's 2005 paper in the Review of Economic Studies is the most-cited piece of evidence in the "schools save lives" argument. It estimated that one additional year of schooling reduces 10-year adult mortality by 3.6 percentage points, which is a 25 to 35 percent reduction in the death rate. That is an enormous effect, and the paper has been cited thousands of times in policy discussions.
It does not replicate.
Bhashkar Mazumder re-ran the analysis in 2008 with different cohort definitions and found no compelling causal effect. Damon Clark and Heather Royer ran the same identification strategy on UK compulsory schooling reforms and published the results in the American Economic Review in 2013: the first stage worked (education and wages both rose), but the health effect was zero. Costas Meghir and co-authors ran it on Swedish reforms in 2018 and found another null. A 2015 meta-analysis of 18 European reforms found a small positive effect for men and zero for women.
The original Lleras-Muney result was likely an artifact of specification choices and a weak first-stage instrument. The right effect size for high-income countries today is probably close to zero. If you've been citing "education saves lives" as part of the K-12 investment case, the verified evidence says you should stop.
The good news for the K-12 case is that it doesn't need that claim. The civic externality that does hold up is crime reduction, and it is concentrated in exactly the population that benefits most from earnings effects.
Lance Lochner and Enrico Moretti's 2004 paper estimated that a single additional high school graduation reduces a man's incarceration probability by 0.76 percentage points if he is white and by 3.4 percentage points if he is Black. This finding has been replicated in Sweden, the UK, and Canada. The crime-cost savings alone are 14 to 26 percent of the private wage return to graduation. Stack that on top of the earnings effects and the equity case for targeted K-12 spending is unusually strong.
What this means
For policy. Stop proposing pay-for-performance for incumbent teachers. The American evidence is conclusive that it doesn't work outside the IMPACT-style bundle. Spend the political capital on selection: multi-measure evaluation, principal authority to non-renew, dismissal of the bottom 5 percent annually. The math is on your side and the evidence is unusually clean.
For unions. The single-salary schedule was designed for a labor market that no longer exists. It made sense when teaching was the only white-collar profession open to high-aptitude women. In a labor market where every profession competes for the same top talent, paying high-performing teachers the same as low-performing ones is a structural commitment to lose your highest-aptitude members to other industries. Raising the ceiling doesn't require cutting the floor. It also doesn't require accepting dismissal-based selection alongside it.
For donors and funders. Money spent on Path B (pay-for-performance) is largely wasted. Money spent on Path A infrastructure (evaluation systems, principal training, replacement pipelines) compounds. Money spent on Path C experiments, meaning a state or large district actually raising the top of the teacher pay scale enough to attract competitive talent, is probably the highest-value research bet in the field. The elasticity that all the cost-benefit models hang on cannot be pinned down without one.
For journalists. "Schools don't matter" is wrong. "Pay teachers more and outcomes will improve" is wrong. "Fire the bad teachers" is the answer that fits the evidence best, doesn't fit any political coalition, and almost nobody writes about.
What would change this
A state or large district actually raising the top of the teacher pay scale at scale would be the cleanest test. Dallas TEI is the closest existing American attempt. If a state implemented Singapore-style differentiated pay tomorrow, we'd have a clean answer on Path C by 2050.
A replication of the Chicago Heights loss-framing experiment in two or three more districts would shift Path B from "doesn't work" to "doesn't work in the form we keep testing."
Someone integrating crime, earnings, and behavioral health into a single cost-benefit study, the way Jackson-Johnson-Persico did for earnings alone, would settle the broader value-of-K-12 question. The fragments are in the literature. Nobody has assembled them.
A new design closing the residual concern about teacher value-added measurement in tracked classrooms (Jesse Rothstein flagged it in 2010, Chetty and co-authors largely answered it in 2014 but not completely) would make the Path A dismissal case airtight.
Data and sources
The Path C cost-benefit numbers come from a back-of-envelope model anchored on Hoxby-Leigh's published instrumental-variable elasticity. Model code is in the source folder for anyone who wants to push on the assumptions.
Key papers behind the claims:
- Jackson, Johnson, Persico (2016) Quarterly Journal of Economics 131(1): 7.25% adult wage gain per 10% K-12 spending, larger for low-income kids
- Chetty, Friedman, Rockoff (2014) American Economic Review 104(9), two papers: 1 SD better teacher → +1% adult earnings, $250K NPV per classroom
- Dee, Wyckoff (2015) Journal of Policy Analysis and Management 34(2): DC IMPACT regression discontinuity
- Adnot, Dee, Katz, Wyckoff (2017) Educational Evaluation and Policy Analysis 39(1): student outcomes after IMPACT teacher exits
- Hanushek (2011) Economics of Education Review 30(3): simulation of dismissal-policy GDP gains
- Fryer (2013) Journal of Labor Economics 31(2): NYC pay-for-performance null
- Springer et al. (2010) POINT Project, Vanderbilt: Nashville $15K bonus null
- Muralidharan, Sundararaman (2011) Journal of Political Economy 119(1): India individual incentive RCT
- Fryer, Levitt, List, Sadoff (2012) NBER WP 18237: Chicago Heights loss-framing
- Hoxby, Leigh (2004) American Economic Review (P&P) 94(2): pay scale flattening as IV for teacher aptitude decline
- Corcoran, Evans, Schwab (2002) NBER WP 9180: historical female teacher aptitude data
- Bacolod (2007) Review of Economics and Statistics 89(4): alternative opportunities and teacher quality
- Hsieh, Hurst, Jones, Klenow (2019) Econometrica 87(5): barrier removal as 15 to 20 percent of GDP growth
- Lleras-Muney (2005) Review of Economic Studies 72(1): the mortality finding that doesn't replicate
- Clark, Royer (2013) American Economic Review 103(6): UK null on mortality
- Meghir, Palme, Simeonova (2018) AEJ: Applied 10(2): Sweden null on mortality
- Lochner, Moretti (2004) American Economic Review 94(1): incarceration effects, larger for Black men
- Heckman et al. (2010) Journal of Public Economics 94(1-2): Perry Preschool corrected IRR
- Garcia, Heckman, Leaf, Prados (2020) Journal of Political Economy 128(7): Abecedarian IRR
Errors are mine.