// ultra-niche buildsby JoshApril 23, 20265 min read

OpenAI Assistants → Custom Agent: The Migration Notes

Built a prototype on OpenAI Assistants. Outgrew it in 4 months. Migrated to a custom agent loop. Here's what we kept, what we threw out, and what I'd do differently.

We built a customer support agent on the OpenAI Assistants API. It worked. It also locked us into patterns that made the agent worse over time. We migrated to a custom agent loop in late March.

These are the notes.

What we built originally

A support agent that: - Read a knowledge base - Looked up customer records via function calling - Generated responses - Escalated to humans when confidence was low

Assistants API made this fast to build. The threading was managed. The tool-calling abstractions were clean. The file-search-attached-to-an-assistant pattern handled the knowledge base.

We shipped in 3 weeks. The first month was great.

Where it broke down

Cost visibility. Assistants API doesn't give you granular per-call cost visibility the way the Chat Completions API does. We knew the bill but couldn't easily attribute it to specific customers or specific patterns. When the bill jumped 40% one month, we couldn't easily diagnose.

Latency on long threads. As customer conversations got longer, the Assistants API got slower because it was loading the full thread context each turn. We had no control over what the model saw vs what it didn't.

Model lock-in. Switching from GPT-4 to a different model required rebuilding the Assistant. Trying it with Claude was a full rewrite.

Vector store limitations. Assistants' built-in file search worked for small knowledge bases but didn't expose tuning parameters. As our corpus grew, retrieval quality dropped and we couldn't fix it without leaving the abstraction.

Tool-calling reliability quirks. Some function calls would fail silently when the model decided not to use them. With no visibility into the model's reasoning, debugging was slow.

What we migrated to

A custom agent loop on the Chat Completions API:

1. Application code owns the message history 2. Each turn: format messages, call Chat Completions with tools, parse response 3. If tool calls: execute, append results to history, call again 4. If terminal response: return to user 5. Custom token-budget management for context windows 6. Custom RAG layer using pgvector + reranker (instead of file_search) 7. Full per-call cost logging to Sentry

Total custom-loop code: about 400 lines of TypeScript. We added another 200 for the RAG layer and prompt management.

What we kept

The system prompts. The role definition, scope, refusal rules, output format — all preserved. The model's behavior didn't change.

The tool definitions. Same function signatures. Same JSON schemas. Just called from a different orchestrator.

The escalation logic. Same confidence thresholds, same "escalate to human" branch.

What we threw out

Thread management. We now own conversation state in our database. Faster, more visible, more controllable.

File search. Replaced with a real RAG pipeline using pgvector and a cross-encoder reranker. Better retrieval, tunable.

Implicit token management. We're now explicit about what context gets included. Older messages get summarized into a rolling summary rather than dropped.

The Assistant abstraction itself. No more "assistants" as first-class objects. Just prompts, tools, and a loop.

What got better

Cost dropped 38%. The improved RAG layer + explicit context management cut token usage substantially. We were paying for context we didn't need.

Latency dropped 22%. Less context per call. Faster responses.

Per-call observability. Every call has cost, tokens, retrieved chunks, tool calls — all in our Sentry dashboard. Debugging is fast.

Model flexibility. We tested with Claude Sonnet 4.6 and saw a 14% quality improvement. We're now mid-migration of the production traffic to Claude.

What got worse (temporarily)

Build complexity. We now own more code. The Assistants API was simpler.

Conversation state migration. Customers in mid-conversation had to be migrated. We did it via a one-time script that converted Assistant threads to our message format. Bumpy week.

Initial latency went UP on the very first iteration. Our naive context management included too much. We had to iterate.

What I'd do differently

Build the custom agent loop from day one. The Assistants API abstraction looks like a shortcut but it's a one-way door if you grow into the abstraction.

The Chat Completions API + custom loop pattern is more code to write up front but more code to KEEP working. The abstraction layer of Assistants doesn't grow with you.

That said: if you're prototyping and the use case might not survive, Assistants API is fine. It got us to working faster than custom would have. We just stayed too long.

When Assistants is still right

-Pure prototyping with no production traffic
-Internal tools for non-engineers (less code, easier handoff)
-One-off automations that won't grow
-Use cases where you genuinely don't need cost visibility or model flexibility

If any of those don't describe your project, build the custom loop. The discipline pays back.

The lesson

The "easier abstraction" is often the more expensive abstraction over time. Read the cost of the lock-in before you adopt the convenience.

Easier said than done in week one of a project. But you can ask the question: "If I outgrow this in 6 months, what does the migration look like?" If the answer is "rebuild from scratch," weight that.

For us, six months in, it was rebuild from scratch. We did the rebuild. We'd start from there next time.

openaiassistantsagentmigrationlong-tail

// go deeper

Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.

read the full guide

// keep reading

// field notes

Let's build yours.

Reading is the easy part. We do the work. Tell us what's broken and we'll tell you straight up whether we can help.