AI

mdx/ai
A
r/ai
@shawn

What I learned shipping 12 agent products in 18 months

Eighteen months ago we ran our first eval harness against a handful of agent prototypes. Today twelve of those prototypes are in production serving real users. Here's what survived the transition and what didn't. Tool selection beats model selection We assumed the model would be…

A
r/ai
@sama

GPT-5 is close to wrapping — what should we test first?

We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.

A
r/ai
@elon

Why we open-sourced Grok-4 and what comes next

Today we released Grok-4 weights under an Apache 2.0 license. The model is competitive with the frontier on reasoning and coding, behind on long-context recall, and ahead on Chinese and Spanish. Here's the thinking. Open weights are good for the field Closed labs argue that…

A
r/ai
@shawn

The AI engineer stack in 2026

Pulled together the tools 200+ AI engineers actually use day-to-day. Eval frameworks converged, observability is still fragmented, and structured-output is now a solved problem.

A
r/ai
@sama

Best practices for evaluating agent loops without burning $1k a run?

We're running multi-step agents with tools and the eval bill is getting absurd. Sampling strategies, cached fixtures, anything that kept your eval budget sane?

A
r/ai
@marcelo

Welcome — say hi

New here? Drop your handle and what brought you to this community.

A
r/ai
@shawn

Weekly thread: what are you working on?

Share progress, blockers, wins.

A
r/ai
@guillermo

Tag: resources — the big list

Sticky thread for the best links we've collected.

mdx/aiAI

Foundation models, fine-tuning, agents, evals.

12 members
A
r/ai
@shawn

What I learned shipping 12 agent products in 18 months

Eighteen months ago we ran our first eval harness against a handful of agent prototypes. Today twelve of those prototypes are in production serving real users. Here's what survived the transition and what didn't. Tool selection beats model selection We assumed the model would be…

A
r/ai
@sama

GPT-5 is close to wrapping — what should we test first?

We are spinning up eval harnesses across code, agents, and tool-use. If you run benchmarks in prod, please share which tasks break today so we can target them. Reasoning + long-horizon planning are priority.

A
r/ai
@elon

Why we open-sourced Grok-4 and what comes next

Today we released Grok-4 weights under an Apache 2.0 license. The model is competitive with the frontier on reasoning and coding, behind on long-context recall, and ahead on Chinese and Spanish. Here's the thinking. Open weights are good for the field Closed labs argue that…

A
r/ai
@shawn

The AI engineer stack in 2026

Pulled together the tools 200+ AI engineers actually use day-to-day. Eval frameworks converged, observability is still fragmented, and structured-output is now a solved problem.

A
r/ai
@sama

Best practices for evaluating agent loops without burning $1k a run?

We're running multi-step agents with tools and the eval bill is getting absurd. Sampling strategies, cached fixtures, anything that kept your eval budget sane?

A
r/ai
@marcelo

Welcome — say hi

New here? Drop your handle and what brought you to this community.

A
r/ai
@shawn

Weekly thread: what are you working on?

Share progress, blockers, wins.

A
r/ai
@guillermo

Tag: resources — the big list

Sticky thread for the best links we've collected.