Skip to content

Context Compaction

Long-running agent sessions accumulate messages until they exceed the model’s context window. When that happens, the LLM rejects the request. Context compaction automatically reduces the conversation size so the agent can keep working without losing important information.

Everruns provides multiple compaction strategies that can be combined. The default auto strategy cascades through all of them in order — from cheapest (free) to most expensive (LLM call) — stopping as soon as the context fits.

┌─────────────────────────────────────────────────────────┐
│ Context Window │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ System │ │ Conversation │ │ Recent │ │
│ │ Prompt │ │ Summary │ │ Messages │ │
│ │ (always │ │ (cold tier │ │ (hot tier │ │
│ │ kept) │ │ replaced) │ │ verbatim) │ │
│ └─────────────┘ └──────────────┘ └───────────────┘ │
│ │
│ ◄──────── Compaction fills this budget ────────────► │
└─────────────────────────────────────────────────────────┘

Compaction operates at two points:

  1. Proactively — before each LLM call, Everruns estimates the token count. If it exceeds a configurable budget threshold (default 85% of the model’s context window), compaction runs before the call is made. This avoids the latency of a failed request.

  2. Reactively — if the LLM still returns a RequestTooLarge error (estimation can undercount), the compaction cascade runs and the request is retried automatically.

In both cases, the same cascade of strategies executes:

Step 1: Observation Masking (free, instant)
└─ Replace old tool outputs with one-line summaries
↓ still over budget?
Step 2: Native Provider Compaction (if available)
└─ Call provider's compact endpoint (e.g., OpenAI /responses/compact)
↓ still over budget?
Step 3: Summarization (LLM call)
└─ Summarize older conversation turns into a structured summary
↓ still over budget?
Step 4: Aggressive Trim (last resort)
└─ Drop oldest messages to fit within the token budget

The UI shows a divider between messages whenever compaction happens:

Context compacted · 142 → 38 messages · observation_masking+summarization

Click the divider to see the cascade details — which strategies ran, how many messages each step produced, and the time taken.

Runs all strategies in order. Stops as soon as context fits. This is the recommended setting for most use cases.

Replaces old tool outputs with compact summaries while keeping the message structure intact. This is free (no LLM call) and preserves tool call IDs for tracing.

Two summary formats:

FormatExampleWhen to use
one_line (default)[read_file → 47 lines, 2340 bytes]Most cases — minimal footprint
head_tailFirst 3 lines + ... (14 lines omitted) ... + last 3 linesWhen partial output context helps

The most recent N tool outputs are always kept verbatim (default: 5).

Delegates compaction to the LLM provider’s own endpoint. Currently supported by OpenAI’s Responses API (/responses/compact). When available, this can be more intelligent than generic strategies since the provider understands its own tokenization.

Uses an LLM to generate a structured summary of older messages. The summary replaces those messages in context and is wrapped in [CONVERSATION_SUMMARY] tags so subsequent compactions can re-summarize it.

You can configure:

  • Which model to use (default: same as the agent)
  • What information to preserve (decisions, files modified, errors, etc.)
  • Custom instructions appended to the summarization prompt

Last resort. Drops the oldest messages to fit within the token budget. The system prompt and the most recent messages are always preserved. This is lossy — dropped messages cannot be recovered unless Infinity Context is enabled.

Compaction is a capability configured per agent or harness via AgentCapabilityConfig.

{
"capabilities": ["compaction"]
}
{
"capabilities": [
{
"ref": "compaction",
"config": {
"strategy": "auto",
"proactive": true,
"budget_percent": 0.85
}
}
]
}
{
"capabilities": [
{
"ref": "compaction",
"config": {
"strategy": "observation_masking",
"observation_masking": {
"keep_recent_tool_outputs": 10,
"summary_format": "head_tail"
}
}
}
]
}
{
"capabilities": [
{
"ref": "compaction",
"config": {
"strategy": "summarization",
"summarization": {
"model": "claude-haiku-4-5-20251001",
"preserve": ["decisions", "files_modified", "errors", "api_keys"],
"instructions": "Focus on architecture decisions and API contract changes"
}
}
}
]
}
{
"capabilities": [
{
"ref": "compaction",
"config": {
"strategy": "auto",
"proactive": true,
"budget_percent": 0.80,
"observation_masking": {
"keep_recent_tool_outputs": 5,
"summary_format": "one_line"
},
"summarization": {
"model": null,
"preserve": ["decisions", "files_modified", "errors", "current_plan"],
"instructions": null
},
"memory_tiers": {
"hot_messages": 20,
"warm_messages": 100
}
}
}
]
}
FieldTypeDefaultDescription
strategystring"auto"Compaction strategy: auto, native, observation_masking, or summarization
proactivebooleantrueCompact before hitting context limits (recommended)
budget_percentfloat0.85Trigger proactive compaction at this fraction of the context window
FieldTypeDefaultDescription
keep_recent_tool_outputsinteger5Number of recent tool outputs to keep verbatim
summary_formatstring"one_line"How to summarize masked outputs: one_line or head_tail
FieldTypeDefaultDescription
modelstring | nullnullModel for summarization. Null = same as the agent’s model
preservestring[]["decisions", "files_modified", "errors", "current_plan"]Information categories to preserve in summaries
instructionsstring | nullnullCustom instructions appended to the summarization prompt
FieldTypeDefaultDescription
hot_messagesinteger20Recent messages kept verbatim (full content)
warm_messagesinteger100Older messages with observation masking applied to tool outputs

Messages beyond hot + warm are in the cold tier — replaced with a conversation summary. If Infinity Context is enabled, cold-tier messages remain queryable via query_history.

Messages (oldest → newest)
┌──────────────────┬───────────────────────┬───────────────┐
│ Cold Tier │ Warm Tier │ Hot Tier │
│ │ │ │
│ Replaced with │ Tool outputs masked │ Full │
│ [CONVERSATION_ │ with one-line │ verbatim │
│ SUMMARY] │ summaries │ content │
│ │ │ │
│ Queryable via │ Message structure │ Always sent │
│ query_history │ preserved │ to the LLM │
│ (if Infinity │ │ │
│ Context on) │ │ │
└──────────────────┴───────────────────────┴───────────────┘
◄── warm_messages ──► ◄── hot_messages ──►

Compaction and Infinity Context are complementary:

  • Infinity Context limits how many messages are loaded from the database into the prompt, and provides query_history for retrieval.
  • Compaction reduces the size of messages that are in the prompt — making tool outputs smaller, summarizing old turns, or trimming when nothing else works.

For long-running sessions, enable both:

{
"capabilities": [
"infinity_context",
{
"ref": "compaction",
"config": {
"strategy": "auto",
"proactive": true
}
}
]
}

With both active, the flow is:

  1. Infinity Context limits messages loaded (e.g., last 100 messages)
  2. Compaction masks old tool outputs in those messages
  3. If still over budget, summarization or trim kicks in
  4. Cold-tier messages remain accessible via query_history

Compaction emits two SSE events:

EventWhenKey fields
context.compactingCascade startsreason (proactive_budget, request_too_large, manual), strategy, messages_before
context.compactedCascade completesstrategy_used, messages_before, messages_after, duration_ms, steps[]

Each step in the cascade is recorded with its strategy name, resulting message count, and duration.

  • Start with defaults. The auto strategy with proactive: true handles most cases well.
  • Lower budget_percent (e.g., 0.70) if your agents use large tool outputs frequently — this gives more headroom before the context fills.
  • Increase keep_recent_tool_outputs if your agent often references recent tool results across multiple turns.
  • Use a cheaper model for summarization (e.g., Haiku) to reduce cost and latency when the summarization step runs.
  • Enable Infinity Context alongside compaction for sessions that run for hours or days.
  • Customize preserve to match your agent’s domain — if your agent tracks database schemas or API contracts, add those to the preserve list.