// ai postmortemsby JoshApril 29, 20264 min read

The Model Swap That Doubled Cost and Dropped Quality

A 'cheaper, faster' model upgrade went the wrong way on every metric. Here's how the trap works and the test that catches it before it costs you a quarter.

The Model Swap That Doubled Cost and Dropped Quality

Someone smarter than me figured out we could save 30% on AI costs by switching from Sonnet to Haiku for our classification tasks. The proposal made sense. Haiku is cheaper per token. Classification is a "simple" task. Let's switch.

Three weeks later the bill was higher and the classifier was worse.

This is how the trap works.

What we measured wrong

The proposal compared per-token cost. Haiku was about 1/8 the cost of Sonnet at the time. We modeled the savings.

The model didn't account for: - Haiku needing more tokens per classification because we had to prompt-engineer harder - Haiku's higher error rate causing retries and downstream re-work - Haiku occasionally requiring a "second pass" through Sonnet for ambiguous cases - The fact that production traffic includes a long tail of weird inputs that the simpler model couldn't handle

What actually happened

Week one: classifications worked. Quality dropped from 96% to 91% accuracy. We told ourselves 91% was acceptable.

Week two: downstream systems started flagging more "low confidence" classifications. Each flag triggered a Sonnet retry (we'd built this fallback). The retry path was getting hit 14% of the time instead of the expected 3%.

Week three: the per-classification cost was now higher than Sonnet-only would have been, because every classification had a chance of paying for both Haiku AND Sonnet runs.

We rolled back. Total cost of the experiment: roughly $1,800 in API spend + 22 hours of engineering time to diagnose and revert.

Root cause

The model-swap math assumed similar quality at lower cost. The reality was lower quality at lower nominal cost, but lower quality cascaded into retries that swamped the savings.

Specifically: the 96% accuracy of Sonnet meant 4% manual review queue. The 91% accuracy of Haiku meant 9% manual review queue. The manual review queue cost us about $2/case in human time. The "savings" on the LLM disappeared when the human review queue doubled.

We measured nominal API cost. We didn't measure total cost of classification including downstream effects.

What I do now

Before any model swap I run a parallel-eval for 1 week. Both models run on the same inputs. We compare:

  • -Accuracy on a labeled test set
  • -Disagreement rate between models (signal for ambiguous inputs)
  • -Confidence-score calibration
  • -Token consumption per task (sometimes the "cheap" model uses more)
  • -Downstream effects (retry rate, manual review rate)

The parallel-eval costs roughly 2x for one week. It catches problems that would cost 10x in production.

If the eval shows quality matches within 1% on accuracy AND downstream metrics are stable, we swap. If not, we don't.

The lesson

Per-token cost is a vanity metric. Total cost per outcome is the real metric.

A cheaper model that's worse can be more expensive in total. A more expensive model that's better can be cheaper in total. The only way to know is to measure total cost, including the cascading downstream effects.

For most "save money on AI" proposals, the savings are real for 80% of tasks and disastrous for 20%. The 20% is where the math breaks. Without a parallel-eval you'll find the 20% in production.

What changes at scale

This trap gets worse at scale. At 10,000 classifications a day, a 5% quality drop is 500 more manual reviews. At 100,000 a day it's 5,000.

The bigger your volume, the more carefully you need to measure before swapping. The temptation to save 30% goes up. The pain of getting it wrong goes up.

I now require parallel-eval before any model swap on any production system. The discipline is worth it.

The exception

If you're prototyping or pre-production, swap freely. The cost is low. The lessons are cheap.

The rule applies only when you have real users depending on the output. Production is where you measure before you change.

postmortemllmmodel selectioncostai engineering
// go deeper

Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.

read the full guide
// keep reading

Related posts

// ready to ship?

Let's build yours.

Reading is the easy part. We do the work. Tell us what's broken and we'll tell you straight up whether we can help.