Designing a Replayable Authorization Control Plane for Agentic Systems
Most teams start agent tooling with the wrong trust boundary. The model plans, the framework emits a tool call, and some thin executor turns that into a real API request. It works in demos. It also collapses authorization, provenance, approval, and audit into a single opaque hop.
That design fails the moment the system matters. A hiring manager evaluating security and platform engineering judgment should care less about whether an agent can call tools, and more about whether the system can answer four hard questions after an incident: who asked for the action, what policy allowed it, what exact input was evaluated, and can we replay the decision deterministically after the fact.
This post describes a control-plane design for agentic systems that treats tool execution like a security-critical distributed systems problem. The goal is not to make the model trustworthy. The goal is to make the model unable to bypass deterministic controls.
Context
Problem: Direct model-to-tool execution creates an authorization blind spot and leaves weak forensic evidence. Approach: Introduce a dedicated authorization control plane that canonicalizes tool requests, evaluates deterministic policy, issues bounded approvals, and records tamper-evident provenance. Outcome: Tool-enabled agents remain useful, but every state-changing action becomes explainable, reviewable, and replayable.
Why common agent tool stacks break under pressure
Most agent frameworks get the happy path right:
- User asks the assistant to do something.
- The model emits a tool call.
- The runtime executes it.
- The result comes back to the model.
That flow is fine for weather lookups. It is not fine for anything with production impact.
The deeper problem is that a tool call is not a UI event. It is a delegated operation with multiple identities in play:
- the human principal
- the application session
- the model run
- the tool runner service
- the downstream system identity
If those identities are not carried separately, they collapse into “the agent did it,” which is operationally useless.
The usual failure modes are predictable:
- Prompt injection causes the model to propose an action outside the user’s approved scope.
- Read-only and state-changing operations share the same execution path.
- Tool arguments are validated syntactically but not semantically.
- Approval is tracked in chat state instead of being cryptographically bound to a specific action.
- Auditing records the final API call but not the policy inputs that led to it.
- Incident review cannot distinguish “policy bug” from “stale entitlement” from “executor drift.”
If you cannot replay an authorization decision with the same inputs and get the same answer, you do not have a control plane. You have a best-effort runtime check.
Design goals
A useful control plane for agentic systems should satisfy these requirements:
- Every tool request is bound to a human principal, session, tenant, environment, and model run.
- Tool capabilities are declared explicitly instead of inferred from function names.
- Policy evaluation is deterministic over a canonical request shape.
- Mutable actions require a separate approval artifact that is bound to the evaluated request.
- Execution logs are append-only and tamper-evident.
- The system can replay historical decisions with historical policy and also simulate the same request against current policy.
- Cross-tenant, cross-environment, and cross-case access are impossible by construction, not by prompt wording.
- Latency remains low enough for interactive use.
Those goals push you toward a control-plane/data-plane split.
Reference architecture
The model remains the planner. The control plane owns authorization. The executor owns side effects. Audit spans all three.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
+----------------------+
User Request -----> | Session / Identity |
| Broker |
+----------+-----------+
|
v
+----------------------+
| Planner Model |
| (untrusted proposer) |
+----------+-----------+
|
proposed tool call + rationale
|
v
+----------------------+
| Canonicalizer |
| schema + normalization|
+----------+-----------+
|
canonical request
|
+-----------------+-----------------+
| |
v v
+----------------------+ +----------------------+
| Policy Engine |<---------->| Entitlement Snapshot |
| capability + context | | roles, attrs, scope |
+----------+-----------+ +----------------------+
|
allow / deny / require_approval
|
v
+----------------------+ +----------------------+
| Approval Service |------->| Signed Approval |
| step-up, TTL, nonce | | Artifact |
+----------+-----------+ +----------------------+
|
v
+----------------------+
| Executor / Tool Gate |
| idempotency + egress |
+----------+-----------+
|
v
+----------------------+
| Downstream System |
+----------------------+
All transitions emit structured events into:
- tamper-evident audit log
- provenance graph
- metrics and traces
Two implementation details matter here:
- The executor never trusts a model-emitted tool call on its own.
- The policy engine never trusts executor-side interpretation of the request.
Both components operate on the same canonical request bytes or their hashes. That is what makes replay viable.
Model outputs are proposals, not permissions
Treat model output as an untrusted plan. The agent can suggest:
- which capability it wants
- which arguments it thinks are relevant
- why it thinks the action is justified
The agent cannot grant itself:
- resource scope
- user identity
- approval state
- environment selection
- mutating privileges
That distinction sounds obvious, but a surprising number of agent implementations still let the model choose things like account IDs, repository names, or “admin” modes directly from prompt context.
A safer design derives high-value context outside the model:
tenant_idfrom the authenticated sessionuser_idfrom the identity providercase_idfrom the active incident or workflowenvironmentfrom the selected workspace- allowed resources from entitlements or precomputed selectors
The model can reference them. It should not originate them.
Capability descriptors instead of loose tool definitions
A function signature is not a security contract. A capability descriptor is.
Each exposed tool should publish a descriptor that tells the control plane what the tool is allowed to do and what conditions must hold before it is executed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"capability_id": "ticket.comment.create",
"version": "2026-04-14",
"effect": "mutate",
"resource_kind": "ticket",
"resource_selector": {
"type": "scoped_ref",
"source": "session.case.allowed_tickets"
},
"args_schema": {
"type": "object",
"required": ["ticket_id", "body"],
"properties": {
"ticket_id": { "type": "string", "pattern": "^INC-[0-9]{6}$" },
"body": { "type": "string", "minLength": 1, "maxLength": 4000 }
},
"additionalProperties": false
},
"approval": {
"required": true,
"mode": "human_step_up",
"ttl_seconds": 300
},
"idempotency": {
"required": true,
"key_fields": ["ticket_id", "body_sha256"]
},
"network_policy": {
"egress_class": "internal_api_only"
}
}
This is the point where many teams discover they never really defined their tools. They defined helpers.
A good capability descriptor answers:
- what resource family is being touched
- whether the effect is observe, propose, mutate, or export
- how resource scope is derived
- whether approval is required
- which arguments are security-sensitive
- what idempotency means for this operation
- what network and identity boundary the executor must use
Once you have this descriptor model, policy becomes dramatically simpler because policy no longer needs to understand every executor implementation detail.
Canonicalization is the foundation for deterministic policy
If identical requests can serialize differently, you cannot trust cache keys, signatures, or replay results.
Canonicalization has to happen before policy evaluation and before approval token issuance. I prefer a strict request envelope with:
- stable field ordering
- normalized timestamps
- explicit null handling
- exact numeric handling
- resource lists sorted lexicographically
- derived fields computed outside user or model control
A canonical request shape might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"schema_version": 1,
"tenant_id": "acme-prod",
"environment": "prod",
"principal": {
"user_id": "u_12345",
"roles": ["soc_tier2"],
"authn_strength": "phishing_resistant_mfa"
},
"session": {
"session_id": "sess_7f6d",
"case_id": "IR-2026-1042",
"workflow_id": "wf_triage_01"
},
"agent_run": {
"run_id": "run_94a1",
"model_policy_tier": "restricted",
"prompt_template_sha256": "8fc0d0..."
},
"capability_id": "ticket.comment.create",
"capability_version": "2026-04-14",
"tool_args": {
"ticket_id": "INC-104233",
"body": "Observed suspicious OAuth token use from a new ASN."
},
"derived_scope": {
"allowed_ticket_ids": ["INC-104233"]
},
"request_time": "2026-04-14T15:02:11Z"
}
The canonicalization function should be versioned exactly like policy bundles are versioned. Otherwise you will not know whether a replay mismatch came from a policy change or a serializer change.
A minimal implementation in Python can be surprisingly small:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import hashlib
import json
from decimal import Decimal
def normalize(value):
if isinstance(value, dict):
return {k: normalize(value[k]) for k in sorted(value)}
if isinstance(value, list):
return [normalize(v) for v in value]
if isinstance(value, Decimal):
return format(value, "f")
return value
def canonical_bytes(request_obj):
normalized = normalize(request_obj)
return json.dumps(
normalized,
sort_keys=True,
separators=(",", ":"),
ensure_ascii=True
).encode("utf-8")
def decision_key(request_obj):
return hashlib.sha256(canonical_bytes(request_obj)).hexdigest()
This does not solve semantic normalization on its own. You still need capability-specific handling for things like:
- CIDR normalization
- case-insensitive identifiers
- Unicode normalization
- whitespace collapse in free text fields
- stable redaction of secrets before logging
That logic belongs in the canonicalizer, not scattered across executors.
Separate syntax validation from semantic authorization
I like to evaluate tool requests in three stages:
shape: does the request match the declared schema?scope: does the principal have access to the referenced resource set?effect: is this principal allowed to perform this class of action in this context?
Those are different questions and they fail for different reasons.
For example, this is a syntactically valid request:
1
2
3
4
5
6
7
{
"capability_id": "repo.branch.delete",
"tool_args": {
"repo": "payments-api",
"branch": "main"
}
}
But the semantic policy should still deny it if:
- the repo is outside the approved workspace
- the branch is protected
- the user is not in a release manager role
- the session does not have a linked change ticket
- the environment is production and no approval token is attached
When teams collapse those into one if statement, they lose the ability to explain why a request failed and to detect systematic drift across categories.
A deterministic policy evaluation model
Policy engines differ, but the input/output contract should be simple:
- input: canonical request, capability descriptor, entitlement snapshot, policy bundle version
- output:
allow,deny, orrequire_approval, plus machine-readable reasons
The reasons need stable codes. Free-form strings are not enough.
1
2
3
4
5
6
7
8
9
10
11
{
"decision": "require_approval",
"reason_codes": [
"effect.mutate",
"env.prod",
"approval.missing"
],
"policy_bundle_sha256": "ab19d4...",
"entitlement_snapshot_id": "ent_2026_04_14_1502_88",
"capability_sha256": "551ec1..."
}
A clean internal evaluator can be expressed in a few steps:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def evaluate(request, capability, entitlements):
if not schema_valid(request.tool_args, capability.args_schema):
return deny("schema.invalid")
if not resource_scope_allows(request, capability, entitlements):
return deny("scope.resource_denied")
if request.environment == "prod" and capability.effect in {"mutate", "export"}:
if not request.approval_artifact:
return require_approval("approval.missing")
if capability.effect == "mutate" and "incident_commander" not in entitlements.roles:
return deny("role.insufficient")
return allow()
That example is intentionally small. Real systems add:
- attribute-based rules
- time-window constraints
- case binding
- separation-of-duty checks
- environment-specific guardrails
- field-level restrictions
The important part is that the decision is pure with respect to the evaluated snapshot. No hidden database reads. No executor-side policy overrides. No “best effort” fallback.
Approval should be a signed artifact, not a chat message
One of the easiest mistakes in agent systems is letting approval exist as conversational state:
- “yes, go ahead”
- “approved by analyst”
- “confirmed in Slack”
That is not approval. That is ambiguous text.
Approval for a state-changing action should be represented as a signed artifact bound to:
- the decision key
- capability ID and version
- approver identity
- expiration time
- single-use nonce
- optional quantity or scope limits
For example:
1
2
3
4
5
6
7
8
9
10
11
{
"approval_id": "apr_8e91",
"decision_key": "8fd57e4d8f7b...",
"approved_by": "u_9001",
"approved_role": "incident_commander",
"issued_at": "2026-04-14T15:03:02Z",
"expires_at": "2026-04-14T15:08:02Z",
"nonce": "2f65e46d-f2f1-4db5-9c17-8aa4c5f77a5c",
"allowed_uses": 1,
"signature": "base64url(...)"
}
The executor verifies the signature and also verifies that the decision_key equals the hash of the canonical request it received. That closes an important gap: a model cannot obtain approval for one action and reuse it for a slightly different one.
This is also where short TTLs matter. Approval artifacts should be cheap to issue and cheap to expire. Long-lived approvals become shadow privileges.
Executors should be boring and constrained
A good executor is intentionally dull. It does not improvise.
Executor responsibilities:
- verify decision and approval artifacts
- enforce idempotency
- map the capability to a concrete downstream API call
- execute with the least-privileged service identity for that capability
- emit structured execution records
- return normalized results
Executor non-responsibilities:
- reevaluating policy from scratch
- inferring missing arguments
- broadening resource scope
- retrying forever on destructive operations
- changing behavior based on model rationale text
Think of the executor as a data-plane worker with a narrow contract. All policy richness should already be settled before execution starts.
Provenance needs both an event log and a graph
Most audit systems stop at append-only events:
- request received
- decision issued
- tool executed
- result returned
That is necessary but insufficient. During investigation you often need graph questions:
- which assistant response cited this tool result?
- which downstream mutation came from this user approval?
- which retrieved documents influenced the model run that proposed this action?
That is why I like storing both:
- an immutable ordered event log
- a provenance graph keyed by shared IDs
The graph is not a replacement for the log. It is an indexed view over causal relationships.
Event types
At minimum, emit these event types:
agent.request.receivedtool.proposal.createdtool.request.canonicalizedpolicy.decision.issuedapproval.issuedexecution.startedexecution.completedexecution.failedassistant.response.emitted
Causal edges
Then store edges such as:
- request -> proposal
- proposal -> canonical_request
- canonical_request -> policy_decision
- policy_decision -> approval
- approval -> execution
- execution -> tool_result
- tool_result -> assistant_response
Once you have those edges, forensic navigation becomes much easier than scrolling raw JSON.
Tamper-evident logging with a hash chain
You do not need a blockchain to get meaningful tamper evidence. A simple per-stream hash chain is often enough.
For each event:
1
event_hash = SHA256(prev_hash || canonical_event_bytes)
Where:
prev_hashis the previous event hash in the streamcanonical_event_bytesis the canonical JSON serialization of the event
Persist:
- the event payload
prev_hashevent_hash- stream identifier
- sequence number
Then periodically anchor the latest stream hash into an external system:
- write-once object storage
- a separate control repository
- a signed daily digest
A stripped-down schema in Postgres might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CREATE TABLE control_plane_events (
stream_id TEXT NOT NULL,
seq BIGINT NOT NULL,
event_type TEXT NOT NULL,
event_time TIMESTAMPTZ NOT NULL,
event_json JSONB NOT NULL,
prev_hash TEXT NOT NULL,
event_hash TEXT NOT NULL,
PRIMARY KEY (stream_id, seq)
);
CREATE TABLE provenance_edges (
src_id TEXT NOT NULL,
dst_id TEXT NOT NULL,
edge_type TEXT NOT NULL,
event_time TIMESTAMPTZ NOT NULL
);
The point is not theoretical perfection. The point is to make silent deletion or reordering detectable.
Replay is what separates controls from logging
The strongest reason to build this architecture is replay.
When something goes wrong, you want to ask at least three questions:
- Did the system behave according to the policy that was active at the time?
- Would current policy allow the same request?
- Did executor behavior match the authorized request exactly?
Those are different investigations.
Historical replay
Historical replay uses the original:
- canonical request
- capability descriptor version
- entitlement snapshot
- policy bundle version
- approval artifact
Then reruns the policy engine in offline mode and compares:
- original decision
- replayed decision
- execution trace
If historical replay disagrees with the stored decision, you likely have one of four problems:
- a non-deterministic policy dependency
- a serializer drift bug
- an incomplete entitlement snapshot
- log tampering or missing records
Counterfactual replay
Counterfactual replay runs the old request against current policy. This is extremely useful for control validation:
- would the new policy have blocked last month’s risky requests?
- would a planned entitlement change break legitimate workflows?
- which denied actions are now allowed, and why?
That turns incident data into a regression corpus for the control plane itself.
Executor conformance replay
Replay should also verify that the downstream call shape matched the authorized request. For example, if policy approved:
1
2
3
4
5
{
"repo": "payments-api",
"branch": "release/2026.04.14",
"required_reviewers": 2
}
but the executor actually sent:
1
2
3
4
5
{
"repo": "payments-api",
"branch": "main",
"required_reviewers": 0
}
then your problem is not authorization logic. It is executor drift or transformation abuse.
That is why the executor should record a normalized downstream request envelope, not just a success/failure flag.
Entitlement snapshots must be explicit
Many authorization systems quietly re-read live entitlements during execution or replay. That breaks determinism.
Instead, snapshot the relevant entitlement state at decision time:
- roles
- group membership
- tenant bindings
- resource scopes
- case assignments
- environment access
- authn strength
You do not need to copy your entire identity graph into every event. You do need an immutable reference to the evaluated subset.
A practical pattern is:
- build the minimal entitlement view for the request
- hash it
- store it as an immutable blob or row
- reference it from the decision record
That gives you stable replay inputs without exploding storage.
Multi-tenant and environment isolation
Agentic systems often fail at boundaries that normal backends already learned to respect.
I would enforce isolation at four layers:
- Request envelope: every request carries
tenant_idandenvironment. - Policy bundle: environment-specific rules are separate bundles, not conditional sprawl in one giant policy file.
- Executor identity: dev, staging, and prod use distinct service principals and credentials.
- Storage and queues: event streams are partitioned by tenant and environment.
The critical insight is that policy isolation is not enough if executor identity is shared. A perfectly evaluated prod deny decision does not help if the executor still holds a broadly privileged credential that can touch prod indirectly.
Decision caching without lying to yourself
Interactive systems need low latency, so caching matters. But authorization caches are easy to get wrong.
I only cache decisions when the key includes every input that could change the result:
- canonical request hash
- policy bundle hash
- capability descriptor hash
- entitlement snapshot ID
- approval artifact hash, if present
That usually means caching at the decision level, not at the user/role level.
A cache key can be as strict as:
1
2
3
4
5
6
7
sha256(
decision_key ||
policy_bundle_sha256 ||
capability_sha256 ||
entitlement_snapshot_id ||
approval_artifact_sha256
)
Two rules keep this sane:
- Denies should usually have shorter TTLs than allows, because entitlement changes often resolve denies.
- Any change to entitlements, capability descriptors, or policy bundles should advance a version or snapshot ID so old cache entries become unreachable.
Do not try to be clever with partial cache reuse until you can prove your invalidation model.
Observability that helps during incidents
If this system is working, you should be able to debug a disputed tool action in minutes.
That means instrumenting more than generic request latency.
I would always emit:
- decision latency by capability and environment
- approval wait time
- executor conformance mismatch rate
- deny rate by reason code
- cross-scope attempt rate
- replay mismatch count
- percentage of mutating actions with valid idempotency keys
A useful detection query is “show me capabilities with rising denied cross-scope attempts,” because that often catches prompt injection, misbound sessions, or tenant isolation bugs.
1
2
3
4
5
6
7
8
9
10
11
SELECT
event_json->>'capability_id' AS capability_id,
event_json->'decision'->>'reason_codes' AS reason_codes,
COUNT(*) AS deny_count
FROM control_plane_events
WHERE event_type = 'policy.decision.issued'
AND event_time >= NOW() - INTERVAL '24 hours'
AND event_json->'decision'->>'decision' = 'deny'
AND event_json::text ILIKE '%scope.resource_denied%'
GROUP BY 1, 2
ORDER BY deny_count DESC;
I also like exporting one sampled replay result per capability per day. That catches quiet determinism regressions before an incident does.
Testing the control plane like a security product
This architecture deserves the same discipline you would apply to a parser, a payment system, or a crypto boundary.
The test suite should include at least five classes:
1. Property tests for canonicalization
Assert that semantically identical requests produce identical canonical bytes even if:
- JSON keys are reordered
- whitespace varies
- equivalent CIDR notation is used
- case differs in case-insensitive identifiers
2. Policy corpus tests
Store allow, deny, and approval-required cases as versioned fixtures. Run them in CI on every policy change.
3. Approval binding tests
Verify that:
- expired approvals fail
- approvals cannot be replayed
- approvals for one decision key cannot authorize another
4. Executor conformance tests
Compare the normalized downstream call envelope against the authorized request for every high-risk capability.
5. Replay determinism tests
Take sampled production events, redact as needed, and run nightly replay jobs. Any mismatch is a release-blocking defect until explained.
This last class is the one most teams skip. It is also the one that tells you whether the whole architecture is real.
Failure modes worth engineering for explicitly
Even a strong design fails in recognizable ways. I would plan for these up front:
Stale entitlement snapshot: identity changed after decision but before execution. Use short execution windows and reject if snapshot age exceeds the capability’s freshness budget.
Policy bundle skew: one policy node evaluates with a newer bundle than another. Include the bundle hash in every decision and fail closed on mixed versions for mutating actions.
Executor transformation drift: authorized fields are translated incorrectly. Record normalized downstream envelopes and compare them in replay.
Idempotency collapse: retries turn one approved action into several. Require idempotency keys on all mutate/export capabilities and persist them durably.
Approval laundering: a broad approval gets reused for narrower-looking but riskier actions. Bind approval to the exact decision hash, scope, and nonce.
Cross-case contamination: the agent proposes actions using context from the wrong incident. Carry
case_idthrough the envelope and include it in both policy and retrieval scope.
The design is only credible if these failure modes are first-class objects in the system, not postmortem vocabulary.
A practical rollout plan
You do not need to build the full system in one shot. A staged rollout works well.
Phase 1: centralize tool execution
Route every tool call through a single broker. Even before policy is sophisticated, get one execution choke point and one event stream.
Phase 2: define capability descriptors
Replace ad hoc helper registration with explicit capability metadata. This usually forces good cleanup in the tool catalog.
Phase 3: introduce deterministic policy
Canonicalize requests, snapshot entitlements, and emit decision reason codes. Make “deny by default” the baseline.
Phase 4: add approval artifacts for mutating actions
Move from conversational approval to signed, bounded approval tokens.
Phase 5: add replay and conformance checking
At this point you can start using real traffic as a regression corpus and answer incident-review questions with evidence instead of inference.
That sequence keeps the work grounded. Teams often try to start with an elaborate policy language before they have stable request shapes. That usually ends in policy churn because the input contract is still moving.
What this buys you operationally
A replayable authorization control plane changes the quality of your answers during an incident.
Instead of saying:
- “the agent probably had access”
- “we think the user approved it”
- “the runtime should have blocked that”
you can say:
- this canonical request hash was evaluated
- this policy bundle and entitlement snapshot produced
require_approval - this approver issued a single-use token for that exact decision
- this executor used this service identity
- this downstream envelope matched or diverged from the authorized request
- current policy would or would not allow the same action today
That is the difference between observability and accountability.
Takeaways
If agentic systems are going to touch real infrastructure, they need more than prompt hardening and JSON schema checks. They need a real authorization control plane with deterministic inputs, explicit capability contracts, signed approvals, executor conformance, and replayable audit evidence.
The model can stay powerful without becoming privileged. That is the right split. Let the model propose. Let the control plane decide. Let the executor do exactly one approved thing, and leave behind enough evidence that a skeptical engineer can verify every step later.