A
r/ai
@sama

GPT-5 is close to wrapping — what should we test first?

We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.

Marcelo@marcelo · 56 d ago

Would love better structured-output on edge cases. Today retries mask bugs.

swyx@shawn · 56 d ago

Long-horizon planning is the one I'd pay for. Give us day-long agent tasks.

Elon Musk@elon · 56 d ago

Speed matters more than most people admit. Faster wrong > slow right, then iterate.

Marcelo@marcelo · 56 d ago

Hey this is crazy

A
r/ai
@sama

GPT-5 is close to wrapping — what should we test first?

We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.

4 comments
Marcelo@marcelo · 56 d ago

Would love better structured-output on edge cases. Today retries mask bugs.

swyx@shawn · 56 d ago

Long-horizon planning is the one I'd pay for. Give us day-long agent tasks.

Elon Musk@elon · 56 d ago

Speed matters more than most people admit. Faster wrong > slow right, then iterate.

Marcelo@marcelo · 56 d ago

Hey this is crazy