A
r/ai
@sama

Best practices for evaluating agent loops without burning $1k a run?

We're running multi-step agents with tools and the eval bill is getting absurd. Sampling strategies, cached fixtures, anything that kept your eval budget sane?

Solved
swyx@shawn · 54 d agoAccepted answer

We run a fixed set of 50 cached trajectories nightly + a 10-prompt smoke set on every PR. Full eval only on release candidates.

Evan You@evan · 54 d ago

LangSmith's deterministic replay cut our spend ~60%. Worth the lock-in for us.

A
r/ai
@sama

Best practices for evaluating agent loops without burning $1k a run?

We're running multi-step agents with tools and the eval bill is getting absurd. Sampling strategies, cached fixtures, anything that kept your eval budget sane?

Solved
2 comments
swyx@shawn · 54 d agoAccepted answer

We run a fixed set of 50 cached trajectories nightly + a 10-prompt smoke set on every PR. Full eval only on release candidates.

Evan You@evan · 54 d ago

LangSmith's deterministic replay cut our spend ~60%. Worth the lock-in for us.