// ai postmortemsby JoshMay 2, 20265 min read

Why We Ripped Out the Embeddings Layer 8 Weeks In

RAG looked like the answer. The problem was a different shape than I thought. Here's what we ripped out, what we replaced it with, and the deciding test.

Eight weeks into a knowledge-management project, I called the client and said "we're rebuilding the core."

We ripped out the embedding-based RAG layer and replaced it with structured search. The system got faster, cheaper, and more accurate.

This is the story of when RAG is the wrong tool and what to do instead.

What we built first

The client had ~12,000 internal documents — policies, procedures, internal memos, FAQs, training materials. They wanted a "search and answer" interface for staff.

I built the obvious thing. Embed every document. Store embeddings in a vector DB. On query, embed the question, find top-K similar documents, send them as context to an LLM, return the synthesized answer.

It worked. Demos were great. Then production showed three problems.

Why it broke

One, the documents had massive overlap. The same policy was restated in 4-7 documents because the firm had a documentation problem before they had a search problem. Embeddings retrieved 5 similar documents, the LLM had to choose, and it would sometimes choose the older or less canonical one.

Two, the most important documents had structured metadata (date last updated, owner, version, supersedes) that the embedding approach ignored. The right answer was almost always the most recent version of the most canonical document. Embeddings can't reliably rank by "most recent" or "most canonical."

Three, staff queries fell into a small number of buckets (PTO policy, expense reimbursement, IT access, etc.). Most queries weren't fishing through 12,000 documents. They were asking one of 200 specific questions.

The embeddings were treating every query as a search problem. Most queries were a retrieval problem with a known answer.

What we replaced it with

Two layers.

Layer one: structured FAQ. The top 200 questions got explicit answers, version-tagged and owner-tagged. When a user asks something that matches a known FAQ (semantic match), the system returns the canonical answer immediately.

Layer two: filtered semantic search. For the long tail of questions, the system still uses embeddings, but with filters applied: only the latest version of each document, only documents owned by the relevant team, only documents with "production" status.

The combined system answers 85% of queries from the FAQ layer (instant, canonical, cheap) and the rest from filtered semantic search (slower but accurate).

The decision test

I would have known to do this from the start if I'd asked one question:

"What are the top 50 things staff need to know? Can someone write the canonical answer in a paragraph each?"

If the answer is yes (the firm has well-known repeated questions), you don't need RAG. You need a smart FAQ.

If the answer is no (every query is unique and exploration-shaped), you need RAG.

Most firms are closer to "yes." Most consultants (including me, for a while) default to "no" because RAG is the technology they're excited to build.

What I do now

For every knowledge management engagement I now run a "top 50 questions" exercise in the first week. Stakeholders write the questions. SMEs write the canonical answers.

If we can fill the list with 50 specific questions that staff actually ask, we build an FAQ system with semantic matching. RAG comes second for the long tail.

If we can't fill the list, that's a signal the firm doesn't have a knowledge problem — they have a knowledge-creation problem. The fix is to create the canonical answers, not to retrieve from chaos.

What changed at the client

Query response time dropped from 8-12 seconds to 1-2 seconds for the 85% covered by FAQ. Anthropic API spend dropped 60% because the FAQ layer doesn't need LLM synthesis. Accuracy improved because the canonical answers are author-verified.

The system is also easier to maintain. When a policy changes, the FAQ entry gets updated by the owner. The change shows up immediately. With RAG, you'd have to re-embed the changed document and hope retrieval picks it.

The lesson

RAG is the right tool when: - Every query is unique - The answer requires synthesis across multiple sources - The corpus is too large for explicit Q&A

RAG is the wrong tool when: - The top 50 queries cover 80%+ of volume - There's a canonical answer for each - The corpus has overlap, versioning, or hierarchy

The wrong tool is fine for demos. The wrong tool is expensive in production.

What I'd build first

The "top 50 questions" exercise. Half a day. Costs nothing. Tells you whether you need RAG or something simpler.

The exercise is the diagnostic. The architecture is the prescription. Don't prescribe before diagnosing.

postmortemragembeddingsai architecturefailure

// go deeper

Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.

read the full guide

// keep reading

// field notes

Let's build yours.

Reading is the easy part. We do the work. Tell us what's broken and we'll tell you straight up whether we can help.