Honest Review: Claude in Production After 3 Months
I've shipped 4 production AI features on Claude over the last quarter. Here's what works, what surprised me, what cost me money, and where I'd reach for a different model.
Claude has been my default for production AI work for the last 14 months. Three months ago I started shipping features that depend on Claude at meaningful scale. Here's the honest take.
What works
Long context. Claude's effective long-context handling is the best in the category. Multi-document analysis, codebase-spanning refactors, long client conversations — Claude holds the thread better than alternatives.
Following negative instructions. When I tell Claude "don't do X," it tends to actually not do X. With other models I find myself needing 3-4 reinforcements to suppress unwanted behaviors.
Code generation. Sonnet 4.6 and Opus 4.7 produce code that's noticeably better aged than ChatGPT or Gemini output. More current API usage. Cleaner patterns.
Refusal calibration. Claude refuses appropriately when needed but rarely over-refuses. The "I cannot do that" rate is acceptable.
What surprised me
Token efficiency. Claude often uses fewer tokens to produce the same output than I expected based on per-token pricing. My total bills are usually lower than equivalent ChatGPT bills for similar work.
Tool use reliability. When I give Claude tool definitions, it uses them more reliably than alternatives. Fewer "hallucinated tool calls" that don't match the actual tools.
Streaming quality. Streaming output feels more natural. Less of the choppy boundary that other models sometimes produce.
What cost me money
The reasoning model trap. Opus is dramatically more expensive than Sonnet, and I sometimes used Opus when Sonnet would have been fine. The quality gain wasn't worth the cost gain for many tasks. I now default to Sonnet and only escalate to Opus when the task genuinely needs it.
Long context overshoot. Sometimes I'd pass entire codebases when only a few files were relevant. The cost adds up. I now use embeddings + retrieval to limit context.
No-cache calls. Anthropic's prompt caching saves 90% on system-prompt tokens. I wasn't using it for the first month. The bill was much higher than it needed to be. Use the cache.
Where I'd reach for something else
Very high-throughput simple tasks. For tasks like "classify this into one of 5 buckets" at 100k/day, Claude Haiku is the right Claude. Sometimes GPT-4o-mini is even cheaper. Match the model to the task.
Image generation. Claude doesn't generate images. Use Replicate, Midjourney, Flux directly.
Hyper-recent information. Claude's training cutoff means it doesn't know about events from the last few months. With web search on, this is fine. Without, it's a limit.
The thing nobody mentions
The Anthropic console for usage and billing is much better than OpenAI's. You can actually see which API key is spending what. You can set alerts. The observability matters when you're shipping things to production.
What changes in 6 months
The model is going to keep getting better. The pricing is going to keep coming down. The features (caching, batching, prompt management) are going to expand.
The right move is to bet on the API, not on a specific model version. Build your system to be model-agnostic at the boundary. Switch between Sonnet versions as new ones ship.
What I'd tell someone starting
Default to Claude for new AI products. The combination of long context, instruction-following, and code quality is the strongest balance available right now.
Use Sonnet for most tasks. Use Opus only when the task genuinely needs more reasoning. Use Haiku for high-volume simple work.
Enable prompt caching from day one. Track per-prompt cost. Don't fly blind.
If you're already on OpenAI and it's working, don't switch reflexively. The differences are real but not so large that switching is automatic. Switch when a specific pain (long context, instruction-following) pushes you.
Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.
read the full guide