Conjuring
Research

Opennotesfromthefrontier.

We publish what we learn about evals, agent design, and the unglamorous corners of production AI. The foundation labs train the models; we work the surfaces around them — and write down what holds up after launch.

What we research

The decisions that matter live downstream of the model. We research the surfaces — orchestration, evaluation, failure modes — and publish what survives contact with real users.

Most agent demos work on Tuesday and break by Friday. The difference is rarely the model; it's eval coverage, retry discipline, and a sane failure surface. These are the patterns we keep reaching for — and the ones we keep abandoning.

Recent

What we've published.

  • 2026-04-22Evals · Agents

    Eval-driven agents — a working harness, in public

    We open-sourced the eval pattern we run on every agent we ship: golden sets, regression suites, traceable replays. Notes on what holds up after launch and what doesn't.

    Coming soon

  • 2026-03-08Orchestration · Production

    Multi-model orchestration without religion

    Routing between Anthropic, OpenAI, and the open-weight labs based on task shape — not vibes. Latency, cost, and a sane fallback ladder when the upstream goes down.

    Coming soon

  • 2026-01-30Fine-tuning · RAG

    When fine-tuning earns its keep (and when it doesn't)

    We default to prompt-engineering + RAG + good evals. Here's the small set of cases where we reached for fine-tuning anyway, and what it cost.

    Coming soon

Topics we track
  • 01Eval design
  • 02Agent architecture
  • 03Multi-model orchestration
  • 04Production observability
  • 05Failure-mode playbooks
  • 06Adversarial testing
Stay in the loop

Get research drops in your inbox.

Posted irregularly. No marketing. Just notes from production.