Best practices for evaluating agent loops without burning $1k a run?

We're running multi-step agents with tools and the eval bill is getting absurd. Sampling strategies, cached fixtures, anything that kept your eval budget sane?

Solved

swyx@shawn · 54 d agoAccepted answer

We run a fixed set of 50 cached trajectories nightly + a 10-prompt smoke set on every PR. Full eval only on release candidates.

Evan You@evan · 54 d ago

LangSmith's deterministic replay cut our spend ~60%. Worth the lock-in for us.

r/ai

@sama7w