Why Agents Fail
Agents fail constantly in production. Understanding the failure categories lets you build targeted recovery strategies instead of generic catch-all handlers:
- Transient failures: API rate limits, network timeouts, temporary service outages. These resolve on their own if you wait and retry.
- Context overflow: The agent's context window fills up mid-task. It loses track of instructions and produces garbage output. No amount of retrying fixes this — you need to reset with a smaller scope.
- Logic failures: The agent misunderstands the task, hallucinates a dependency, or enters an infinite loop. These require re-prompting with clearer instructions, not retrying the same prompt.
- Resource failures: Disk full, memory exhausted, file permissions denied. These are environmental and need human intervention or automated provisioning.
Retry Logic and Backoff
Retries are the first line of defense against transient failures. The rules are simple but critical to get right:
- Exponential backoff: Wait 1 second, then 2, then 4, then 8. Never retry immediately — you will hammer an already-struggling service and make things worse.
- Jitter: Add random delay to each retry interval. Without jitter, 10 agents all retry at exactly the same time, causing a thundering herd that triggers another outage.
- Max attempts: Cap retries at 3-5 attempts. If a task fails 5 times, it is not a transient issue. Escalate to a different strategy or a human.
- Idempotency: Ensure retried tasks produce the same result if run multiple times. If an agent writes to a file, retrying should overwrite, not append duplicate content.
Circuit Breakers
A circuit breaker stops calling a failing service after repeated failures. It has three states:
- Closed (normal): Requests flow through. Failures are counted. When failures exceed a threshold (e.g., 5 in 60 seconds), the circuit opens.
- Open (blocked): All requests are rejected immediately without attempting execution. This protects the failing service and saves tokens. After a cooldown period, the circuit enters half-open.
- Half-open (probing): One test request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens with a longer cooldown.
Apply circuit breakers at the agent level. If your code reviewer agent fails 3 times in a row, stop sending it reviews. Route to a backup reviewer or queue tasks until the primary recovers.
Health Checks
Do not wait for failures to discover broken agents. Proactive health checks catch problems before they affect real tasks:
- Heartbeat: Every 60 seconds, each agent responds with its status. If an agent misses two heartbeats, mark it unhealthy and stop routing tasks to it.
- Smoke test: Before assigning a critical task, send the agent a trivial test task. If it responds correctly, proceed. If not, failover immediately.
- Output validation: After every task, validate the agent's output against expected constraints. Did the code agent produce valid TypeScript? Did the reviewer return structured feedback? Invalid output signals degradation.
Automatic Failover
Failover is the last resort — when retries are exhausted and the circuit breaker is open. A robust failover strategy has three layers:
- Agent substitution: Route the task to a backup agent with the same capabilities. Maintain at least two agents for every critical role.
- Graceful degradation: If no backup is available, reduce scope. Instead of a full code review, run an automated linter. Partial results beat no results.
- Dead letter queue: Tasks that fail all recovery paths go into a queue for human review. Never silently drop a task — always leave an audit trail.
Practical Exercise
Build an error recovery wrapper for a 3-agent pipeline (coder, reviewer, tester):
- Add retry logic with exponential backoff (max 3 attempts) to each agent
- Implement a circuit breaker that opens after 2 consecutive failures
- Add a health check that validates output format before passing to the next agent
- Create a failover path: if the reviewer fails, run an automated lint check instead
- Add a dead letter queue that logs unrecoverable failures with full context
Test by simulating failures: force the reviewer to return malformed output, then verify the circuit breaker triggers and the failover path activates correctly.
Never lose work to agent failures
The AI Brain Pro package includes production-grade error recovery with circuit breakers, automatic failover, and dead letter queues pre-configured for every agent role.
View Pricing →