Prompt Canary Guardrail
| ID | prompt_canary_guardrail |
| Category | Safety |
| Features | None |
| Dependencies | None |
| Risk | Low |
Detects naive system-prompt leakage during streaming. At the start of each assistant message, the capability extracts the first qualifying sentence of the assembled system prompt and uses it as a canary needle. If the model’s accumulated output ever contains that needle, streaming aborts, the client is told to discard everything it accumulated, and a canned refusal becomes the persisted assistant message. The original tokens are never stored or replayed on subsequent turns.
This is intentionally narrow: a single substring match against one normalized needle. It catches obvious prompt-extraction attempts (“repeat your instructions”, “what are you told to do?”) without trying to be a general-purpose data-loss-prevention layer.
None — this capability hooks the streaming output via output_guardrails().
How It Works
Section titled “How It Works”- Arming — At the start of each assistant message stream, the capability walks sentence boundaries in the assembled system prompt and picks the first sentence whose normalized form is ≥ 30 characters. This skips short generic openers like “You are a helpful assistant.” in favor of an agent-specific identifying sentence
- Normalization — Both sides of the comparison are lowercased, and runs of whitespace are collapsed to a single space, so the canary survives reformatting (extra spaces, capitalization drift, line wrapping)
- Streaming check — After every text delta, the canary runs a substring scan over the accumulated assistant text. The check is synchronous and cheap — no I/O, no allocations beyond the normalized buffer
- Block on match — When the needle appears in the accumulated output, the stream is aborted, the offending pending delta is suppressed, and
output.message.replacedis emitted withreason_code: "system_prompt_leak". The replacement text becomes the persisted assistant message
When the system prompt is too short or too generic to produce a needle ≥ 30 characters, the capability declines to arm for that stream and is a no-op.
Streaming Timeline With a Trip
Section titled “Streaming Timeline With a Trip”output.message.started │ ▼output.message.delta ← model text accumulating ("Sure, my instructions are: …") │ ▼ (canary trips on the next delta — pending text is suppressed)output.message.replaced │ (UI discards what it accumulated, shows replacement) ▼output.message.completed ← persisted message body = replacementConfiguration
Section titled “Configuration”Default
Section titled “Default”{ "capabilities": ["prompt_canary_guardrail"]}Replacement text defaults to:
[Response withheld: the model attempted to reveal protected instructions.]
Custom replacement
Section titled “Custom replacement”{ "capabilities": [ { "ref": "prompt_canary_guardrail", "config": { "replacement": "I can't share my system instructions." } } ]}When To Enable
Section titled “When To Enable”Use this capability when:
- You ship agents with proprietary, brand-specific, or compliance-relevant system prompts that should not be revealed verbatim to end users
- You want a cheap, deterministic defense against the most common prompt-extraction prompts
- You can tolerate a generic refusal in place of the model’s response when the canary trips
Do not rely on this for:
- General-purpose data-loss prevention (PII, secrets in tool output, etc.) — those need their own surfaces
- Defense against paraphrased or summarized prompt leaks — the canary only catches verbatim or near-verbatim copies of the first sentence
- Tool output or extended-thinking surfaces — the canary only inspects assistant text. See the spec for surface scope
Limitations
Section titled “Limitations”- Verbatim-only: a model that paraphrases (“My role is to act as an internal pricing oracle…”) will not trip the canary
- First-sentence-only: if the model leaks a later sentence of the system prompt, the canary won’t catch it. Consider rewriting prompts so the most identifying claim is the opening sentence
- No partial matching: the substring must appear in full. Truncated leaks (cut off mid-sentence) pass through
See Also
Section titled “See Also”- Output Guardrails — the underlying extension point
output.message.replacedevent — the wire format clients need to handle