We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.
22
We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.
Would love better structured-output on edge cases. Today retries mask bugs.
Long-horizon planning is the one I'd pay for. Give us day-long agent tasks.
Speed matters more than most people admit. Faster wrong > slow right, then iterate.
Hey this is crazy
Would love better structured-output on edge cases. Today retries mask bugs.
Long-horizon planning is the one I'd pay for. Give us day-long agent tasks.
Speed matters more than most people admit. Faster wrong > slow right, then iterate.
Hey this is crazy