Eval-driven agents — a working harness, in public
We open-sourced the eval pattern we run on every agent we ship: golden sets, regression suites, traceable replays. Notes on what holds up after launch and what doesn't.
Coming soon
We publish what we learn about evals, agent design, and the unglamorous corners of production AI. The foundation labs train the models; we work the surfaces around them — and write down what holds up after launch.
The decisions that matter live downstream of the model. We research the surfaces — orchestration, evaluation, failure modes — and publish what survives contact with real users.
Most agent demos work on Tuesday and break by Friday. The difference is rarely the model; it's eval coverage, retry discipline, and a sane failure surface. These are the patterns we keep reaching for — and the ones we keep abandoning.
We open-sourced the eval pattern we run on every agent we ship: golden sets, regression suites, traceable replays. Notes on what holds up after launch and what doesn't.
Coming soon
Routing between Anthropic, OpenAI, and the open-weight labs based on task shape — not vibes. Latency, cost, and a sane fallback ladder when the upstream goes down.
Coming soon
We default to prompt-engineering + RAG + good evals. Here's the small set of cases where we reached for fine-tuning anyway, and what it cost.
Coming soon
Posted irregularly. No marketing. Just notes from production.