Post

Canary Tokens for RAG Exfiltration Detection

Canary Tokens for RAG Exfiltration Detection

RAG exfiltration often looks like normal usage until it is too late. Attackers can ask repeated reformulated questions that gradually reconstruct sensitive data.

Canary markers give you early warning. By embedding traceable decoy strings in protected content zones, you can detect unauthorized retrieval or leakage attempts quickly.

Context

Problem: Sensitive data leakage through RAG can be hard to detect with coarse monitoring. Approach: Embed and monitor canary markers across high-risk document classes. Outcome: Exfiltration attempts generate detectable and actionable signals.

Threat model and failure modes

  • Prompt chains that gradually extract restricted details.
  • Malicious insider retrieval of privileged document sets.
  • Automated scraping of assistant outputs.
  • Unauthorized downstream sharing of model responses.

Control design

  • Insert unique canary markers into restricted documents and chunks.
  • Alert when canary markers appear in model output or API egress logs.
  • Rotate canary values on a defined schedule.
  • Tie canary hits to session identity and retrieval traces.
  • Use canaries with rate and behavior analytics for stronger signal quality.

Implementation pattern

Canaries should be realistic enough to flow through retrieval but meaningless for business logic. Keep a mapping table so alerts can identify exact source zones and owners.

1
2
3
4
5
6
7
Example canary format
FIN-OPS-INT-ALERT-{tenant}-{random_id}

Alert trigger
- Any appearance in assistant response body
- Any appearance in export or webhook payloads

Research and standards

These controls align well with guidance from OWASP Top 10 for LLM Applications, NIST AI RMF practices, and MITRE ATLAS adversarial behavior patterns.

Validation checklist

  • Run controlled prompts that should never return canary content.
  • Verify alert routing reaches both security and data owners.
  • Test false positive handling with synthetic benign matches.
  • Confirm canary inventory includes document owner and classification.
  • Review detection coverage after index rebuilds or schema changes.

Takeaways

Canaries will not prevent leakage alone, but they shrink detection time dramatically and provide concrete evidence for response and containment.

This post is licensed under CC BY 4.0 by the author.