7 of 10 · 25 min

Token Optimization Strategies

Why Tokens Cost Money

Every character Claude reads and writes consumes tokens. Input tokens cost money when Claude reads your files, and output tokens cost more when Claude generates responses. A single session that reads 20 large files and generates detailed analysis can consume millions of tokens — the equivalent of several dollars in API costs.

The goal is not to use fewer tokens blindly but to use them efficiently. Read only what you need, generate only what is useful, and route tasks to the cheapest model that can handle them. Token optimization is about getting the same quality output at a fraction of the cost.

Context Window Management

The context window is finite. Once it fills up, Claude loses the ability to reference earlier information. Proactive context management means being deliberate about what enters the window:

Read selectively: Use line ranges instead of reading entire files. If you need lines 50-80, do not read all 500 lines
Prune aggressively: Keep memory.md under 100 lines. Archive stale information to daily notes
Front-load critical context: Put the most important information in CLAUDE.md where it is always visible, not buried in files that may be evicted

A well-managed context window means Claude has access to relevant information for the entire session instead of forgetting critical details halfway through.

Prompt Compression

Prompt compression reduces the token count of your instructions without losing meaning. The techniques are straightforward:

Use tables instead of prose: A retrieval map as a table uses 60% fewer tokens than the same information written as paragraphs
Eliminate redundancy: If a rule appears in both CLAUDE.md and memory.md, keep it in one place and reference it from the other
Use shorthand for repeated patterns: Define abbreviations early ("SC = server component") and use them throughout
Remove filler language: "Please make sure to always" becomes "Always." Imperative sentences are shorter and clearer

Smart Model Routing

Not every task needs the most powerful model. Smart routing matches task complexity to model capability:

Haiku (fast, cheap): File renaming, simple formatting, boilerplate generation, git operations
Sonnet (balanced): Feature implementation, code review, refactoring, most daily coding work
Opus (deep, expensive): Architecture decisions, complex debugging, security audits, multi-file refactors

A team that routes 70% of tasks to Sonnet and only escalates the hardest 10% to Opus will spend a fraction of what a team that uses Opus for everything spends — with nearly identical results.

Caching Strategies

Prompt caching lets you avoid re-sending the same context on every request. When your CLAUDE.md, type definitions, and memory files are cached, Claude only processes new content in each turn. This can reduce input token costs by 50-90% for long sessions.

Structure your files for maximum cache hits: keep stable content (rules, architecture decisions) in files that rarely change, and volatile content (current task, progress) in separate files that update frequently. The stable files stay cached while only the volatile files consume fresh tokens.

Practical Exercise

Audit and Optimize Your Token Usage

Measure your current usage and apply three optimizations:

Measure baseline: Check your Anthropic dashboard for token usage over the past week. Note the input/output ratio
Compress your CLAUDE.md: Rewrite your longest prose sections as tables or bullet lists. Target a 30% reduction in line count
Add model routing rules: Create a section in CLAUDE.md that specifies which model tier to use for common task types
Split stable vs volatile: Separate your rarely-changing rules from frequently-updating status into different files for better caching

Want automated token tracking? The AI Brain Pro package includes a PostToolUse hook that logs token consumption per session and alerts you when costs spike. View pricing →

← Knowledge Graph Setup Next: MCP Server Integration →