Observability for long-running agent runs

OpenTelemetry assumes spans end. Agent runs disagree.

A single Amy job can spawn 200 tool calls across an eight-hour window, with retries, partial failures, and human approvals in the middle. Treating that as one trace blows up the exporter; treating each tool call as its own trace loses the parent context.

What works: a trace per logical phase, a stable correlation ID stitched through metadata, and ledger-style append-only events for anything we'd want to replay. The traces stop being the source of truth and start being the search index over the events.

Observability for long-running agent runs

Keep reading

Inside Amy's credit system

How we sandbox untrusted browser tools

Designing the Amy job scheduler