What I learned shipping 12 agent products in 18 months

Eighteen months ago we ran our first eval harness against a handful of agent prototypes. Today twelve of those prototypes are in production serving real users. Here's what survived the transition and what didn't.

Tool selection beats model selection

We assumed the model would be the bottleneck. It wasn't. The bottleneck was almost always the tool surface — too many tools, ambiguous schemas, or a function that almost did what the agent needed but not quite. When we cut tool counts to under ten and tightened JSON schemas, success rates jumped 30-40 points across every model we tested.

The eval gap is real

Public benchmarks correlate weakly with how an agent performs on your specific workflow. We ended up writing custom evals for every product:

Replay logs from real user sessions
Synthetic edge cases generated by the team
A small set of golden tasks the team scored manually each week

The manual scoring sounds like overhead. It's the most leveraged thing we do. It catches drift that automated metrics miss for weeks.

Memory is unsolved

Every product that needed long-horizon memory either bolted on a vector store and called it done, or rolled a custom recall layer with hand-tuned heuristics. The second group shipped better products. If you're building anything that requires the agent to remember user preferences across sessions, plan for at least a month of memory work.

What's next

The frontier is moving toward agents that compose other agents. We're watching this carefully but not betting on it yet — the orchestration overhead and debugging story are still rough. Stay tuned for a Q4 post on what we learned from our first multi-agent product.

What I learned shipping 12 agent products in 18 months

Tool selection beats model selection

The eval gap is real

Public benchmarks correlate weakly with how an agent performs on your specific workflow. We ended up writing custom evals for every product:

Replay logs from real user sessions
Synthetic edge cases generated by the team
A small set of golden tasks the team scored manually each week

The manual scoring sounds like overhead. It's the most leveraged thing we do. It catches drift that automated metrics miss for weeks.