7 of 8 · 30 min

Monitoring and Observability

Why Observe Your Agents

Agents are black boxes by default. They receive a prompt, produce an output, and you hope the middle part went well. In production, hope is not a strategy. Observability answers three questions:

What happened? A complete record of every action an agent took, every tool it called, and every decision it made. Without this, debugging a failure requires reproducing it from scratch.
Why did it happen? Context around decisions — what data the agent saw, which branch of logic it followed, why it chose one approach over another. Logs tell you what; traces tell you why.
Is it getting worse? Trend data over time — success rates, execution times, token usage, error frequency. A slow degradation is invisible without metrics. By the time you notice quality dropping, it has been bad for weeks.

Structured Logging

Unstructured logs (plain text strings) are useless at scale. You cannot filter, aggregate, or alert on free-form text. Structured logging uses key-value pairs that machines can parse:

Required fields: Timestamp (ISO 8601), agent name, task ID, action type, duration in milliseconds, and status (success/failure/skipped). Every log entry must include all six.
Context fields: Input hash (not the full input — that is too large), output summary, token count, model used, and retry attempt number. These enable deep debugging without storing entire prompts.
Format: JSON lines — one JSON object per line. Every log aggregation tool (Loki, CloudWatch, Datadog) can ingest JSON lines natively. Never use custom delimiters or multi-line log formats.
Correlation IDs: Assign a unique ID to each workflow run. Every agent in the workflow includes this ID in its logs. When investigating a failure, filter by correlation ID to see the complete execution path across all agents.

Audit Trails

An audit trail is an immutable record of every significant action. Unlike logs (which are for debugging), audit trails are for accountability and compliance:

What to record: File creates, file edits, file deletes, git commits, API calls to external services, and task assignments. If an agent changed something in the real world, it goes in the audit trail.
Immutability: Audit entries must be append-only. No agent should be able to edit or delete its own audit records. Write to a separate database or log stream that agents have write-only access to.
Retention: Keep audit trails for at least 90 days. For regulated environments, longer. Storage is cheap — the cost of losing an audit trail during an incident investigation is not.

Performance Metrics

Track four core metrics for every agent. These form your performance baseline and make anomalies immediately visible:

Task duration (p50, p95, p99): Median tells you typical performance. The 95th percentile reveals slow outliers. If p95 is 10x the median, you have intermittent issues worth investigating.
Success rate: Percentage of tasks that complete without errors. Anything below 95% needs immediate attention. Below 90% means the agent is unreliable and should be taken offline for debugging.
Token consumption per task: Track input and output tokens separately. A sudden spike in token usage often means the agent is receiving bloated context or generating verbose, unfocused output.
Queue depth: How many tasks are waiting for each agent. A growing queue means the agent cannot keep up with demand — you need to scale up or redistribute work.

Session Analytics

Session-level analytics aggregate individual task metrics into meaningful business insights:

Daily summary: Total tasks processed, total tokens used, overall success rate, total cost. This is your daily health check — one glance tells you if the system is healthy.
Agent comparison: Side-by-side performance of all agents. Which agent has the highest error rate? Which consumes the most tokens per task? This identifies your weakest link.
Trend analysis: Plot metrics weekly. Are execution times trending up? Is the success rate declining? Trends predict future problems before they become outages.
Cost attribution: Break down API costs by agent, task type, and workflow. Know exactly where your token budget is going so you can optimize the expensive parts first.

Practical Exercise

Add observability to an existing agent workflow:

Implement structured JSON logging with the six required fields for every agent action
Add correlation IDs that trace a task through all agents in a pipeline
Create an audit trail that records every file change with before/after hashes
Track task duration and token usage, then generate a daily summary report
Set up an alert that fires when any agent's success rate drops below 90%

Run 20 tasks through the pipeline and review the logs. You should be able to trace any single task from start to finish using only its correlation ID.

Full visibility into your agent fleet

The AI Brain Pro package includes a built-in observability dashboard with structured logging, audit trails, performance metrics, and cost tracking — zero configuration required.

View Pricing →

← Previous: Production Deployment Next: Scaling Your Agent Army →