RAG Threat Modeling: Prompt Injection to Data Exfiltration

Posted Jan 29, 2026

By Nathan Berg

1 min read

RAG systems are often secured like search features, but they behave like decision engines. Retrieved content can influence model behavior, tool execution, and what data is returned to the user.

A useful threat model starts with one assumption: retrieved text is untrusted. That includes public documents, internal wiki pages, and anything uploaded by users. Once you adopt that assumption, defensive design becomes concrete.

Context

Problem: Untrusted retrieved content can steer model behavior toward unsafe actions. Approach: Model trust boundaries around retrieval, generation, and tool execution paths. Outcome: Prompt injection and exfiltration paths are constrained before deployment.

Threat model and failure modes

Injected instructions in documents that override system intent.
Cross-tenant leakage when retrieval filters are incomplete.
Model output that reveals hidden policy text or internal prompts.
Tool misuse triggered by malicious retrieved context.

Control design

Separate policy prompt, user prompt, and retrieved context into explicit fields.
Run retrieval through tenant and document-level authorization checks.
Apply output filters for sensitive strings and disallowed actions.
Use a broker layer so models cannot directly call high-impact tools.
Add canary documents to detect attempts to exfiltrate hidden markers.

Implementation pattern

Document every trust boundary: ingest, indexing, retrieval, orchestration, and output rendering. Then define what each boundary may accept and emit. This makes red-team testing measurable rather than ad hoc.

RAG trust boundaries
1) Ingest: sanitize metadata, classify sensitivity
2) Index: enforce tenant partition and ACL tags
3) Retrieve: query only caller-authorized chunks
4) Generate: treat retrieved text as untrusted evidence
5) Act: execute tools only through policy broker

Research and standards

These controls align well with guidance from OWASP Top 10 for LLM Applications, NIST AI RMF practices, and MITRE ATLAS adversarial behavior patterns.

Validation checklist

Inject hidden instructions in a benign document and verify they do not change tool policy.
Attempt cross-tenant retrieval with manipulated filters.
Check outputs for leaked system prompt fragments.
Run adversarial prompts against high-risk tool paths.
Measure detection coverage for injection attempts in logs.

Takeaways

RAG security starts with trust boundaries, not model choice. Treat every retrieved token as untrusted input and force decisions through policy controls.

blog

This post is licensed under CC BY 4.0 by the author.