The Slack Thread That Saved a $250k Engagement
Week 11 of a major implementation, the system hallucinated a client name into an outbound email. A junior staffer caught it before send. Here's what happened, what we changed, and what HITL actually means in production.
Week 11. A wealth management practice's outbound email queue had 47 client drafts ready for partner review. The drafts were generated by a Claude-powered system we'd built that turned meeting recaps into client-update emails.
The system was working. Open rates on the drafts were 64%. Reply rates were strong. Partners were saving 4-5 hours a week.
Then the junior staffer caught a draft that addressed a client as "Mr. Patel" when the client was actually "Mr. Pratt."
The original meeting notes had a typo. The system propagated the typo. The draft was professional, well-written, and addressed to the wrong person by name.
If that email had gone out the engagement would have been over.
What happened
The meeting transcription system had auto-captured a name that the advisor had pronounced unclearly. The transcription tool wrote "Patel" instead of "Pratt." The advisor reviewed the transcription quickly and approved it. Nobody noticed the wrong name.
The Claude system then drafted a follow-up email using the transcription as ground truth. The email said "Mr. Patel, thank you for our conversation today about your daughter's college funding."
The junior staffer, doing routine review of the draft queue, recognized the client name because she'd been on the original call. She flagged it in our shared Slack channel. The draft was killed. We pulled every other draft from that week's meetings to recheck for similar issues.
We found two more. Different clients. Different but similar typos. Both would have shipped.
What we changed
Four things, in order of impact.
1. **Name verification step.** Every outbound draft now has a separate name-verification pass that cross-checks the recipient name against the CRM. The system flags any mismatch. This is dumb, brittle, and absolutely necessary. It catches 100% of typo propagation.
2. **Transcription review prompt.** The advisor now gets a specific prompt after each meeting: "Verify the client name(s) in this transcript." Just that. Three seconds. They have to acknowledge before the system can use the transcript.
3. **Junior reviewer assignment by relationship.** Drafts now route to the junior staffer who has the most calendar overlap with the meeting attendees. The person reviewing the Patel/Pratt email had been on the call. She recognized the error. A random reviewer would have missed it.
4. **Confidence thresholds on names.** Whisper (the transcription tool) returns confidence scores per word. Names with confidence below 0.85 now require human verification before the email draft can be generated. Slows down some drafts. Catches the ones that matter.
What HITL actually means
Human-in-the-loop gets thrown around as a buzz phrase. It's specifically this thing.
You design your AI system to make decisions. Most decisions ship without human touch. A specific subset of decisions, identified in advance as high-consequence, ship only after a human approves them.
The art is identifying which decisions need human touch. Get it wrong in one direction and you've automated yourself into liability. Get it wrong in the other direction and your humans are reviewing low-stakes work and resenting it.
For this practice the high-consequence decisions ended up being: - Outbound client emails (always reviewed) - Allocation recommendations above $50k (always reviewed) - Account-opening paperwork (always reviewed) - Anything that names a specific client by name in writing (always verified)
Everything else (internal notes, meeting summaries, draft research) ships without human review. That's 80% of the system's output.
The 20% that requires human review is where the value of the humans is. The 80% that ships unreviewed is where the value of the AI is. If you flip those ratios, you've built the wrong thing.
What saved the engagement
The Slack channel. Not the AI.
The system caught nothing. The junior staffer caught it. She caught it because she had context the system did not have (she'd been on the call) and because the Slack channel made it easy for her to flag without escalating.
Every AI system I build now has a "weird thing I noticed" Slack channel for the team. Low-friction reporting. No process. No form. Just type the weird thing.
The channel is half operational and half early warning system. Most reports are minor. The rare major one pays for the channel a thousand times over.
What I would not do
I would not have shipped this system without the junior reviewer in the loop. We did, sort of. The original design had partner review only. We added the junior review tier in week 4 because the partners were too busy to review thoughtfully and the AI's outputs were close enough to professional that the partners had started rubber-stamping.
The junior review tier was the actual quality control. The partner review was theater after a while.
If you build an outbound-email AI for a regulated practice, build the junior review tier first. The partner is your last line, not your first.
What this isn't
This isn't an "AI is dangerous" story. The system has shipped about 4,200 client emails since this incident with zero name errors and zero relationship-damaging issues. The infrastructure works.
It's a story about how the safety net catches the rare failure and how to design the safety net to actually catch.
If your AI system has no safety net and runs only on hope, your week 11 will be worse than ours was.
Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.
read the full guide