Research

Notes on tool-use evals at small budgets

AT
Amy Team
Jan 13, 2026 · 1 min read

The "build a giant eval set" advice is great when you have a research team. We don't, so we did the small version: twenty hand-curated cases, each one a regression we'd already shipped at least once.

The twenty cases catch about 80% of model-swap regressions before they reach users. The remaining 20% mostly surface as user reports within a day, which is fast enough to roll back. The lesson, again: small and curated beats large and procedural when the cost of being wrong is bounded.

Want Amy to take this off your plate?
Pick a ready-made assistant and try it free.
Browse assistants
AT
Amy Team
Research

Keep reading