Research

Notes on tool-use evals at small budgets

Amy Team

Jan 13, 2026 · 1 min read

The "build a giant eval set" advice is great when you have a research team. We don't, so we did the small version: twenty hand-curated cases, each one a regression we'd already shipped at least once.

The twenty cases catch about 80% of model-swap regressions before they reach users. The remaining 20% mostly surface as user reports within a day, which is fast enough to roll back. The lesson, again: small and curated beats large and procedural when the cost of being wrong is bounded.

Want Amy to take this off your plate?

Pick a ready-made assistant and try it free.

Browse assistants

Amy Team

Research

Keep reading

Research

Notes on prompt caching at agent scale

Research

Notes on tool-use evals at small budgets

Keep reading

Notes on prompt caching at agent scale

When to fine-tune vs. prompt-engineer