Context Compaction
Long-running agent sessions accumulate messages until they exceed the model’s context window. When that happens, the LLM rejects the request. Context compaction automatically reduces the conversation size so the agent can keep working without losing important information.
Everruns provides multiple compaction strategies that can be combined. The default auto strategy cascades through all of them in order — from cheapest (free) to most expensive (LLM call) — stopping as soon as the context fits.
┌─────────────────────────────────────────────────────────┐│ Context Window ││ ││ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ││ │ System │ │ Conversation │ │ Recent │ ││ │ Prompt │ │ Summary │ │ Messages │ ││ │ (always │ │ (cold tier │ │ (hot tier │ ││ │ kept) │ │ replaced) │ │ verbatim) │ ││ └─────────────┘ └──────────────┘ └───────────────┘ ││ ││ ◄──────── Compaction fills this budget ────────────► │└─────────────────────────────────────────────────────────┘How It Works
Section titled “How It Works”Compaction operates at two points:
-
Proactively — before each LLM call, Everruns estimates the token count. If it exceeds a configurable budget threshold (default 85% of the model’s context window), compaction runs before the call is made. This avoids the latency of a failed request.
-
Reactively — if the LLM still returns a
RequestTooLargeerror (estimation can undercount), the compaction cascade runs and the request is retried automatically.
In both cases, the same cascade of strategies executes:
Step 1: Observation Masking (free, instant) └─ Replace old tool outputs with one-line summaries ↓ still over budget?Step 2: Native Provider Compaction (if available) └─ Call provider's compact endpoint (e.g., OpenAI /responses/compact) ↓ still over budget?Step 3: Summarization (LLM call) └─ Summarize older conversation turns into a structured summary ↓ still over budget?Step 4: Aggressive Trim (last resort) └─ Drop oldest messages to fit within the token budgetThe UI shows a divider between messages whenever compaction happens:
Context compacted · 142 → 38 messages · observation_masking+summarization
Click the divider to see the cascade details — which strategies ran, how many messages each step produced, and the time taken.
Strategies
Section titled “Strategies”Auto (default)
Section titled “Auto (default)”Runs all strategies in order. Stops as soon as context fits. This is the recommended setting for most use cases.
Observation Masking
Section titled “Observation Masking”Replaces old tool outputs with compact summaries while keeping the message structure intact. This is free (no LLM call) and preserves tool call IDs for tracing.
Two summary formats:
| Format | Example | When to use |
|---|---|---|
one_line (default) | [read_file → 47 lines, 2340 bytes] | Most cases — minimal footprint |
head_tail | First 3 lines + ... (14 lines omitted) ... + last 3 lines | When partial output context helps |
The most recent N tool outputs are always kept verbatim (default: 5).
Native Provider Compaction
Section titled “Native Provider Compaction”Delegates compaction to the LLM provider’s own endpoint. Currently supported by OpenAI’s Responses API (/responses/compact). When available, this can be more intelligent than generic strategies since the provider understands its own tokenization.
Summarization
Section titled “Summarization”Uses an LLM to generate a structured summary of older messages. The summary replaces those messages in context and is wrapped in [CONVERSATION_SUMMARY] tags so subsequent compactions can re-summarize it.
You can configure:
- Which model to use (default: same as the agent)
- What information to preserve (decisions, files modified, errors, etc.)
- Custom instructions appended to the summarization prompt
Aggressive Trim
Section titled “Aggressive Trim”Last resort. Drops the oldest messages to fit within the token budget. The system prompt and the most recent messages are always preserved. This is lossy — dropped messages cannot be recovered unless Infinity Context is enabled.
Configuration
Section titled “Configuration”Compaction is a capability configured per agent or harness via AgentCapabilityConfig.
Default (auto strategy, proactive)
Section titled “Default (auto strategy, proactive)”{ "capabilities": ["compaction"]}Custom strategy and budget
Section titled “Custom strategy and budget”{ "capabilities": [ { "ref": "compaction", "config": { "strategy": "auto", "proactive": true, "budget_percent": 0.85 } } ]}Observation masking only (no LLM calls)
Section titled “Observation masking only (no LLM calls)”{ "capabilities": [ { "ref": "compaction", "config": { "strategy": "observation_masking", "observation_masking": { "keep_recent_tool_outputs": 10, "summary_format": "head_tail" } } } ]}Summarization with a cheaper model
Section titled “Summarization with a cheaper model”{ "capabilities": [ { "ref": "compaction", "config": { "strategy": "summarization", "summarization": { "model": "claude-haiku-4-5-20251001", "preserve": ["decisions", "files_modified", "errors", "api_keys"], "instructions": "Focus on architecture decisions and API contract changes" } } } ]}Full configuration with memory tiers
Section titled “Full configuration with memory tiers”{ "capabilities": [ { "ref": "compaction", "config": { "strategy": "auto", "proactive": true, "budget_percent": 0.80, "observation_masking": { "keep_recent_tool_outputs": 5, "summary_format": "one_line" }, "summarization": { "model": null, "preserve": ["decisions", "files_modified", "errors", "current_plan"], "instructions": null }, "memory_tiers": { "hot_messages": 20, "warm_messages": 100 } } } ]}Configuration Reference
Section titled “Configuration Reference”Top-level
Section titled “Top-level”| Field | Type | Default | Description |
|---|---|---|---|
strategy | string | "auto" | Compaction strategy: auto, native, observation_masking, or summarization |
proactive | boolean | true | Compact before hitting context limits (recommended) |
budget_percent | float | 0.85 | Trigger proactive compaction at this fraction of the context window |
Observation Masking
Section titled “Observation Masking”| Field | Type | Default | Description |
|---|---|---|---|
keep_recent_tool_outputs | integer | 5 | Number of recent tool outputs to keep verbatim |
summary_format | string | "one_line" | How to summarize masked outputs: one_line or head_tail |
Summarization
Section titled “Summarization”| Field | Type | Default | Description |
|---|---|---|---|
model | string | null | null | Model for summarization. Null = same as the agent’s model |
preserve | string[] | ["decisions", "files_modified", "errors", "current_plan"] | Information categories to preserve in summaries |
instructions | string | null | null | Custom instructions appended to the summarization prompt |
Memory Tiers
Section titled “Memory Tiers”| Field | Type | Default | Description |
|---|---|---|---|
hot_messages | integer | 20 | Recent messages kept verbatim (full content) |
warm_messages | integer | 100 | Older messages with observation masking applied to tool outputs |
Messages beyond hot + warm are in the cold tier — replaced with a conversation summary. If Infinity Context is enabled, cold-tier messages remain queryable via query_history.
Memory Tier Diagram
Section titled “Memory Tier Diagram” Messages (oldest → newest) ┌──────────────────┬───────────────────────┬───────────────┐ │ Cold Tier │ Warm Tier │ Hot Tier │ │ │ │ │ │ Replaced with │ Tool outputs masked │ Full │ │ [CONVERSATION_ │ with one-line │ verbatim │ │ SUMMARY] │ summaries │ content │ │ │ │ │ │ Queryable via │ Message structure │ Always sent │ │ query_history │ preserved │ to the LLM │ │ (if Infinity │ │ │ │ Context on) │ │ │ └──────────────────┴───────────────────────┴───────────────┘ ◄── warm_messages ──► ◄── hot_messages ──►Combining with Infinity Context
Section titled “Combining with Infinity Context”Compaction and Infinity Context are complementary:
- Infinity Context limits how many messages are loaded from the database into the prompt, and provides
query_historyfor retrieval. - Compaction reduces the size of messages that are in the prompt — making tool outputs smaller, summarizing old turns, or trimming when nothing else works.
For long-running sessions, enable both:
{ "capabilities": [ "infinity_context", { "ref": "compaction", "config": { "strategy": "auto", "proactive": true } } ]}With both active, the flow is:
- Infinity Context limits messages loaded (e.g., last 100 messages)
- Compaction masks old tool outputs in those messages
- If still over budget, summarization or trim kicks in
- Cold-tier messages remain accessible via
query_history
Events
Section titled “Events”Compaction emits two SSE events:
| Event | When | Key fields |
|---|---|---|
context.compacting | Cascade starts | reason (proactive_budget, request_too_large, manual), strategy, messages_before |
context.compacted | Cascade completes | strategy_used, messages_before, messages_after, duration_ms, steps[] |
Each step in the cascade is recorded with its strategy name, resulting message count, and duration.
Best Practices
Section titled “Best Practices”- Start with defaults. The
autostrategy withproactive: truehandles most cases well. - Lower
budget_percent(e.g., 0.70) if your agents use large tool outputs frequently — this gives more headroom before the context fills. - Increase
keep_recent_tool_outputsif your agent often references recent tool results across multiple turns. - Use a cheaper model for summarization (e.g., Haiku) to reduce cost and latency when the summarization step runs.
- Enable Infinity Context alongside compaction for sessions that run for hours or days.
- Customize
preserveto match your agent’s domain — if your agent tracks database schemas or API contracts, add those to the preserve list.
See Also
Section titled “See Also”- Infinity Context — Message history windowing and retrieval
- Capabilities Overview — How capabilities are configured
- Harnesses — Where capability configs are applied
- Events — SSE event streaming reference