Research

Opennotesfromthefrontier.

We publish what we learn about evals, agent design, and the unglamorous corners of production AI. The foundation labs train the models; we work the surfaces around them — and write down what holds up after launch.

What we research

The decisions that matter live downstream of the model. We research the surfaces — orchestration, evaluation, failure modes — and publish what survives contact with real users.

Most agent demos work on Tuesday and break by Friday. The difference is rarely the model; it's eval coverage, retry discipline, and a sane failure surface. These are the patterns we keep reaching for — and the ones we keep abandoning.

Recent

What we've published.

2026-04-22Evals · Agents
Eval-driven agents — a working harness, in public
We open-sourced the eval pattern we run on every agent we ship: golden sets, regression suites, traceable replays. Notes on what holds up after launch and what doesn't.
Coming soon
2026-03-08Orchestration · Production
Multi-model orchestration without religion
Routing between Anthropic, OpenAI, and the open-weight labs based on task shape — not vibes. Latency, cost, and a sane fallback ladder when the upstream goes down.
Coming soon
2026-01-30Fine-tuning · RAG
When fine-tuning earns its keep (and when it doesn't)
We default to prompt-engineering + RAG + good evals. Here's the small set of cases where we reached for fine-tuning anyway, and what it cost.
Coming soon

Topics we track

01Eval design
02Agent architecture
03Multi-model orchestration
04Production observability
05Failure-mode playbooks
06Adversarial testing

Stay in the loop

Get research drops in your inbox.

Posted irregularly. No marketing. Just notes from production.

Email us to subscribe Read the journal →

Opennotesfromthefrontier.

What we've published.

Eval-driven agents — a working harness, in public

Multi-model orchestration without religion

When fine-tuning earns its keep (and when it doesn't)

Get research drops in your inbox.