Designing a Replayable Authorization Control Plane for Agentic Systems

Posted Apr 14, 2026

By Nathan Berg

21 min read

Most teams start agent tooling with the wrong trust boundary. The model plans, the framework emits a tool call, and some thin executor turns that into a real API request. It works in demos. It also collapses authorization, provenance, approval, and audit into a single opaque hop.

That design fails the moment the system matters. A hiring manager evaluating security and platform engineering judgment should care less about whether an agent can call tools, and more about whether the system can answer four hard questions after an incident: who asked for the action, what policy allowed it, what exact input was evaluated, and can we replay the decision deterministically after the fact.

This post describes a control-plane design for agentic systems that treats tool execution like a security-critical distributed systems problem. The goal is not to make the model trustworthy. The goal is to make the model unable to bypass deterministic controls.

Context

Problem: Direct model-to-tool execution creates an authorization blind spot and leaves weak forensic evidence. Approach: Introduce a dedicated authorization control plane that canonicalizes tool requests, evaluates deterministic policy, issues bounded approvals, and records tamper-evident provenance. Outcome: Tool-enabled agents remain useful, but every state-changing action becomes explainable, reviewable, and replayable.

Why common agent tool stacks break under pressure

Most agent frameworks get the happy path right:

User asks the assistant to do something.
The model emits a tool call.
The runtime executes it.
The result comes back to the model.

That flow is fine for weather lookups. It is not fine for anything with production impact.

The deeper problem is that a tool call is not a UI event. It is a delegated operation with multiple identities in play:

the human principal
the application session
the model run
the tool runner service
the downstream system identity

If those identities are not carried separately, they collapse into “the agent did it,” which is operationally useless.

The usual failure modes are predictable:

Prompt injection causes the model to propose an action outside the user’s approved scope.
Read-only and state-changing operations share the same execution path.
Tool arguments are validated syntactically but not semantically.
Approval is tracked in chat state instead of being cryptographically bound to a specific action.
Auditing records the final API call but not the policy inputs that led to it.
Incident review cannot distinguish “policy bug” from “stale entitlement” from “executor drift.”

If you cannot replay an authorization decision with the same inputs and get the same answer, you do not have a control plane. You have a best-effort runtime check.

Design goals

A useful control plane for agentic systems should satisfy these requirements:

Every tool request is bound to a human principal, session, tenant, environment, and model run.
Tool capabilities are declared explicitly instead of inferred from function names.
Policy evaluation is deterministic over a canonical request shape.
Mutable actions require a separate approval artifact that is bound to the evaluated request.
Execution logs are append-only and tamper-evident.
The system can replay historical decisions with historical policy and also simulate the same request against current policy.
Cross-tenant, cross-environment, and cross-case access are impossible by construction, not by prompt wording.
Latency remains low enough for interactive use.

Those goals push you toward a control-plane/data-plane split.

Reference architecture

The model remains the planner. The control plane owns authorization. The executor owns side effects. Audit spans all three.

                    +----------------------+
User Request -----> | Session / Identity   |
                    | Broker               |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    | Planner Model        |
                    | (untrusted proposer) |
                    +----------+-----------+
                               |
                  proposed tool call + rationale
                               |
                               v
                    +----------------------+
                    | Canonicalizer        |
                    | schema + normalization|
                    +----------+-----------+
                               |
                      canonical request
                               |
             +-----------------+-----------------+
             |                                   |
             v                                   v
  +----------------------+            +----------------------+
  | Policy Engine        |<---------->| Entitlement Snapshot |
  | capability + context |            | roles, attrs, scope  |
  +----------+-----------+            +----------------------+
             |
        allow / deny / require_approval
             |
             v
  +----------------------+        +----------------------+
  | Approval Service     |------->| Signed Approval      |
  | step-up, TTL, nonce  |        | Artifact             |
  +----------+-----------+        +----------------------+
             |
             v
  +----------------------+
  | Executor / Tool Gate |
  | idempotency + egress |
  +----------+-----------+
             |
             v
  +----------------------+
  | Downstream System    |
  +----------------------+

All transitions emit structured events into:
- tamper-evident audit log
- provenance graph
- metrics and traces

Two implementation details matter here:

The executor never trusts a model-emitted tool call on its own.
The policy engine never trusts executor-side interpretation of the request.

Both components operate on the same canonical request bytes or their hashes. That is what makes replay viable.

Model outputs are proposals, not permissions

Treat model output as an untrusted plan. The agent can suggest:

which capability it wants
which arguments it thinks are relevant
why it thinks the action is justified

The agent cannot grant itself:

resource scope
user identity
approval state
environment selection
mutating privileges

That distinction sounds obvious, but a surprising number of agent implementations still let the model choose things like account IDs, repository names, or “admin” modes directly from prompt context.

A safer design derives high-value context outside the model:

tenant_id from the authenticated session
user_id from the identity provider
case_id from the active incident or workflow
environment from the selected workspace
allowed resources from entitlements or precomputed selectors

The model can reference them. It should not originate them.

Capability descriptors instead of loose tool definitions

A function signature is not a security contract. A capability descriptor is.

Each exposed tool should publish a descriptor that tells the control plane what the tool is allowed to do and what conditions must hold before it is executed.

  
{
  "capability_id": "ticket.comment.create",
  "version": "2026-04-14",
  "effect": "mutate",
  "resource_kind": "ticket",
  "resource_selector": {
    "type": "scoped_ref",
    "source": "session.case.allowed_tickets"
  },
  "args_schema": {
    "type": "object",
    "required": ["ticket_id", "body"],
    "properties": {
      "ticket_id": { "type": "string", "pattern": "^INC-[0-9]{6}$" },
      "body": { "type": "string", "minLength": 1, "maxLength": 4000 }
    },
    "additionalProperties": false
  },
  "approval": {
    "required": true,
    "mode": "human_step_up",
    "ttl_seconds": 300
  },
  "idempotency": {
    "required": true,
    "key_fields": ["ticket_id", "body_sha256"]
  },
  "network_policy": {
    "egress_class": "internal_api_only"
  }
}

This is the point where many teams discover they never really defined their tools. They defined helpers.

A good capability descriptor answers:

what resource family is being touched
whether the effect is observe, propose, mutate, or export
how resource scope is derived
whether approval is required
which arguments are security-sensitive
what idempotency means for this operation
what network and identity boundary the executor must use

Once you have this descriptor model, policy becomes dramatically simpler because policy no longer needs to understand every executor implementation detail.

Canonicalization is the foundation for deterministic policy

If identical requests can serialize differently, you cannot trust cache keys, signatures, or replay results.

Canonicalization has to happen before policy evaluation and before approval token issuance. I prefer a strict request envelope with:

stable field ordering
normalized timestamps
explicit null handling
exact numeric handling
resource lists sorted lexicographically
derived fields computed outside user or model control

A canonical request shape might look like this:

  
{
  "schema_version": 1,
  "tenant_id": "acme-prod",
  "environment": "prod",
  "principal": {
    "user_id": "u_12345",
    "roles": ["soc_tier2"],
    "authn_strength": "phishing_resistant_mfa"
  },
  "session": {
    "session_id": "sess_7f6d",
    "case_id": "IR-2026-1042",
    "workflow_id": "wf_triage_01"
  },
  "agent_run": {
    "run_id": "run_94a1",
    "model_policy_tier": "restricted",
    "prompt_template_sha256": "8fc0d0..."
  },
  "capability_id": "ticket.comment.create",
  "capability_version": "2026-04-14",
  "tool_args": {
    "ticket_id": "INC-104233",
    "body": "Observed suspicious OAuth token use from a new ASN."
  },
  "derived_scope": {
    "allowed_ticket_ids": ["INC-104233"]
  },
  "request_time": "2026-04-14T15:02:11Z"
}

The canonicalization function should be versioned exactly like policy bundles are versioned. Otherwise you will not know whether a replay mismatch came from a policy change or a serializer change.

A minimal implementation in Python can be surprisingly small:

  
import hashlib
import json
from decimal import Decimal


def normalize(value):
    if isinstance(value, dict):
        return {k: normalize(value[k]) for k in sorted(value)}
    if isinstance(value, list):
        return [normalize(v) for v in value]
    if isinstance(value, Decimal):
        return format(value, "f")
    return value


def canonical_bytes(request_obj):
    normalized = normalize(request_obj)
    return json.dumps(
        normalized,
        sort_keys=True,
        separators=(",", ":"),
        ensure_ascii=True
    ).encode("utf-8")


def decision_key(request_obj):
    return hashlib.sha256(canonical_bytes(request_obj)).hexdigest()

This does not solve semantic normalization on its own. You still need capability-specific handling for things like:

CIDR normalization
case-insensitive identifiers
Unicode normalization
whitespace collapse in free text fields
stable redaction of secrets before logging

That logic belongs in the canonicalizer, not scattered across executors.

Separate syntax validation from semantic authorization

I like to evaluate tool requests in three stages:

shape: does the request match the declared schema?
scope: does the principal have access to the referenced resource set?
effect: is this principal allowed to perform this class of action in this context?

Those are different questions and they fail for different reasons.

For example, this is a syntactically valid request:

  
{
  "capability_id": "repo.branch.delete",
  "tool_args": {
    "repo": "payments-api",
    "branch": "main"
  }
}

But the semantic policy should still deny it if:

the repo is outside the approved workspace
the branch is protected
the user is not in a release manager role
the session does not have a linked change ticket
the environment is production and no approval token is attached

When teams collapse those into one if statement, they lose the ability to explain why a request failed and to detect systematic drift across categories.

A deterministic policy evaluation model

Policy engines differ, but the input/output contract should be simple:

input: canonical request, capability descriptor, entitlement snapshot, policy bundle version
output: allow, deny, or require_approval, plus machine-readable reasons

The reasons need stable codes. Free-form strings are not enough.

  
{
  "decision": "require_approval",
  "reason_codes": [
    "effect.mutate",
    "env.prod",
    "approval.missing"
  ],
  "policy_bundle_sha256": "ab19d4...",
  "entitlement_snapshot_id": "ent_2026_04_14_1502_88",
  "capability_sha256": "551ec1..."
}

A clean internal evaluator can be expressed in a few steps:

  
def evaluate(request, capability, entitlements):
    if not schema_valid(request.tool_args, capability.args_schema):
        return deny("schema.invalid")

    if not resource_scope_allows(request, capability, entitlements):
        return deny("scope.resource_denied")

    if request.environment == "prod" and capability.effect in {"mutate", "export"}:
        if not request.approval_artifact:
            return require_approval("approval.missing")

    if capability.effect == "mutate" and "incident_commander" not in entitlements.roles:
        return deny("role.insufficient")

    return allow()

That example is intentionally small. Real systems add:

attribute-based rules
time-window constraints
case binding
separation-of-duty checks
environment-specific guardrails
field-level restrictions

The important part is that the decision is pure with respect to the evaluated snapshot. No hidden database reads. No executor-side policy overrides. No “best effort” fallback.

Approval should be a signed artifact, not a chat message

One of the easiest mistakes in agent systems is letting approval exist as conversational state:

“yes, go ahead”
“approved by analyst”
“confirmed in Slack”

That is not approval. That is ambiguous text.

Approval for a state-changing action should be represented as a signed artifact bound to:

the decision key
capability ID and version
approver identity
expiration time
single-use nonce
optional quantity or scope limits

For example:

  
{
  "approval_id": "apr_8e91",
  "decision_key": "8fd57e4d8f7b...",
  "approved_by": "u_9001",
  "approved_role": "incident_commander",
  "issued_at": "2026-04-14T15:03:02Z",
  "expires_at": "2026-04-14T15:08:02Z",
  "nonce": "2f65e46d-f2f1-4db5-9c17-8aa4c5f77a5c",
  "allowed_uses": 1,
  "signature": "base64url(...)"
}

The executor verifies the signature and also verifies that the decision_key equals the hash of the canonical request it received. That closes an important gap: a model cannot obtain approval for one action and reuse it for a slightly different one.

This is also where short TTLs matter. Approval artifacts should be cheap to issue and cheap to expire. Long-lived approvals become shadow privileges.

Executors should be boring and constrained

A good executor is intentionally dull. It does not improvise.

Executor responsibilities:

verify decision and approval artifacts
enforce idempotency
map the capability to a concrete downstream API call
execute with the least-privileged service identity for that capability
emit structured execution records
return normalized results

Executor non-responsibilities:

reevaluating policy from scratch
inferring missing arguments
broadening resource scope
retrying forever on destructive operations
changing behavior based on model rationale text

Think of the executor as a data-plane worker with a narrow contract. All policy richness should already be settled before execution starts.

Provenance needs both an event log and a graph

Most audit systems stop at append-only events:

request received
decision issued
tool executed
result returned

That is necessary but insufficient. During investigation you often need graph questions:

which assistant response cited this tool result?
which downstream mutation came from this user approval?
which retrieved documents influenced the model run that proposed this action?

That is why I like storing both:

an immutable ordered event log
a provenance graph keyed by shared IDs

The graph is not a replacement for the log. It is an indexed view over causal relationships.

Event types

At minimum, emit these event types:

agent.request.received
tool.proposal.created
tool.request.canonicalized
policy.decision.issued
approval.issued
execution.started
execution.completed
execution.failed
assistant.response.emitted

Causal edges

Then store edges such as:

request -> proposal
proposal -> canonical_request
canonical_request -> policy_decision
policy_decision -> approval
approval -> execution
execution -> tool_result
tool_result -> assistant_response

Once you have those edges, forensic navigation becomes much easier than scrolling raw JSON.

Tamper-evident logging with a hash chain

You do not need a blockchain to get meaningful tamper evidence. A simple per-stream hash chain is often enough.

For each event:

event_hash = SHA256(prev_hash || canonical_event_bytes)

Where:

prev_hash is the previous event hash in the stream
canonical_event_bytes is the canonical JSON serialization of the event

Persist:

the event payload
prev_hash
event_hash
stream identifier
sequence number

Then periodically anchor the latest stream hash into an external system:

write-once object storage
a separate control repository
a signed daily digest

A stripped-down schema in Postgres might look like this:

  
CREATE TABLE control_plane_events (
  stream_id TEXT NOT NULL,
  seq BIGINT NOT NULL,
  event_type TEXT NOT NULL,
  event_time TIMESTAMPTZ NOT NULL,
  event_json JSONB NOT NULL,
  prev_hash TEXT NOT NULL,
  event_hash TEXT NOT NULL,
  PRIMARY KEY (stream_id, seq)
);

CREATE TABLE provenance_edges (
  src_id TEXT NOT NULL,
  dst_id TEXT NOT NULL,
  edge_type TEXT NOT NULL,
  event_time TIMESTAMPTZ NOT NULL
);

The point is not theoretical perfection. The point is to make silent deletion or reordering detectable.

Replay is what separates controls from logging

The strongest reason to build this architecture is replay.

When something goes wrong, you want to ask at least three questions:

Did the system behave according to the policy that was active at the time?
Would current policy allow the same request?
Did executor behavior match the authorized request exactly?

Those are different investigations.

Historical replay

Historical replay uses the original:

canonical request
capability descriptor version
entitlement snapshot
policy bundle version
approval artifact

Then reruns the policy engine in offline mode and compares:

original decision
replayed decision
execution trace

If historical replay disagrees with the stored decision, you likely have one of four problems:

a non-deterministic policy dependency
a serializer drift bug
an incomplete entitlement snapshot
log tampering or missing records

Counterfactual replay

Counterfactual replay runs the old request against current policy. This is extremely useful for control validation:

would the new policy have blocked last month’s risky requests?
would a planned entitlement change break legitimate workflows?
which denied actions are now allowed, and why?

That turns incident data into a regression corpus for the control plane itself.

Executor conformance replay

Replay should also verify that the downstream call shape matched the authorized request. For example, if policy approved:

  
{
  "repo": "payments-api",
  "branch": "release/2026.04.14",
  "required_reviewers": 2
}

but the executor actually sent:

  
{
  "repo": "payments-api",
  "branch": "main",
  "required_reviewers": 0
}

then your problem is not authorization logic. It is executor drift or transformation abuse.

That is why the executor should record a normalized downstream request envelope, not just a success/failure flag.

Entitlement snapshots must be explicit

Many authorization systems quietly re-read live entitlements during execution or replay. That breaks determinism.

Instead, snapshot the relevant entitlement state at decision time:

roles
group membership
tenant bindings
resource scopes
case assignments
environment access
authn strength

You do not need to copy your entire identity graph into every event. You do need an immutable reference to the evaluated subset.

A practical pattern is:

build the minimal entitlement view for the request
hash it
store it as an immutable blob or row
reference it from the decision record

That gives you stable replay inputs without exploding storage.

Multi-tenant and environment isolation

Agentic systems often fail at boundaries that normal backends already learned to respect.

I would enforce isolation at four layers:

Request envelope: every request carries tenant_id and environment.
Policy bundle: environment-specific rules are separate bundles, not conditional sprawl in one giant policy file.
Executor identity: dev, staging, and prod use distinct service principals and credentials.
Storage and queues: event streams are partitioned by tenant and environment.

The critical insight is that policy isolation is not enough if executor identity is shared. A perfectly evaluated prod deny decision does not help if the executor still holds a broadly privileged credential that can touch prod indirectly.

Decision caching without lying to yourself

Interactive systems need low latency, so caching matters. But authorization caches are easy to get wrong.

I only cache decisions when the key includes every input that could change the result:

canonical request hash
policy bundle hash
capability descriptor hash
entitlement snapshot ID
approval artifact hash, if present

That usually means caching at the decision level, not at the user/role level.

A cache key can be as strict as:

sha256(
  decision_key ||
  policy_bundle_sha256 ||
  capability_sha256 ||
  entitlement_snapshot_id ||
  approval_artifact_sha256
)

Two rules keep this sane:

Denies should usually have shorter TTLs than allows, because entitlement changes often resolve denies.
Any change to entitlements, capability descriptors, or policy bundles should advance a version or snapshot ID so old cache entries become unreachable.

Do not try to be clever with partial cache reuse until you can prove your invalidation model.

Observability that helps during incidents

If this system is working, you should be able to debug a disputed tool action in minutes.

That means instrumenting more than generic request latency.

I would always emit:

decision latency by capability and environment
approval wait time
executor conformance mismatch rate
deny rate by reason code
cross-scope attempt rate
replay mismatch count
percentage of mutating actions with valid idempotency keys

A useful detection query is “show me capabilities with rising denied cross-scope attempts,” because that often catches prompt injection, misbound sessions, or tenant isolation bugs.

  
SELECT
  event_json->>'capability_id' AS capability_id,
  event_json->'decision'->>'reason_codes' AS reason_codes,
  COUNT(*) AS deny_count
FROM control_plane_events
WHERE event_type = 'policy.decision.issued'
  AND event_time >= NOW() - INTERVAL '24 hours'
  AND event_json->'decision'->>'decision' = 'deny'
  AND event_json::text ILIKE '%scope.resource_denied%'
GROUP BY 1, 2
ORDER BY deny_count DESC;

I also like exporting one sampled replay result per capability per day. That catches quiet determinism regressions before an incident does.

Testing the control plane like a security product

This architecture deserves the same discipline you would apply to a parser, a payment system, or a crypto boundary.

The test suite should include at least five classes:

1. Property tests for canonicalization

Assert that semantically identical requests produce identical canonical bytes even if:

JSON keys are reordered
whitespace varies
equivalent CIDR notation is used
case differs in case-insensitive identifiers

2. Policy corpus tests

Store allow, deny, and approval-required cases as versioned fixtures. Run them in CI on every policy change.

3. Approval binding tests

Verify that:

expired approvals fail
approvals cannot be replayed
approvals for one decision key cannot authorize another

4. Executor conformance tests

Compare the normalized downstream call envelope against the authorized request for every high-risk capability.

5. Replay determinism tests

Take sampled production events, redact as needed, and run nightly replay jobs. Any mismatch is a release-blocking defect until explained.

This last class is the one most teams skip. It is also the one that tells you whether the whole architecture is real.

Failure modes worth engineering for explicitly

Even a strong design fails in recognizable ways. I would plan for these up front:

Stale entitlement snapshot: identity changed after decision but before execution. Use short execution windows and reject if snapshot age exceeds the capability’s freshness budget.
Policy bundle skew: one policy node evaluates with a newer bundle than another. Include the bundle hash in every decision and fail closed on mixed versions for mutating actions.
Executor transformation drift: authorized fields are translated incorrectly. Record normalized downstream envelopes and compare them in replay.
Idempotency collapse: retries turn one approved action into several. Require idempotency keys on all mutate/export capabilities and persist them durably.
Approval laundering: a broad approval gets reused for narrower-looking but riskier actions. Bind approval to the exact decision hash, scope, and nonce.
Cross-case contamination: the agent proposes actions using context from the wrong incident. Carry case_id through the envelope and include it in both policy and retrieval scope.

The design is only credible if these failure modes are first-class objects in the system, not postmortem vocabulary.

A practical rollout plan

You do not need to build the full system in one shot. A staged rollout works well.

Phase 1: centralize tool execution

Route every tool call through a single broker. Even before policy is sophisticated, get one execution choke point and one event stream.

Phase 2: define capability descriptors

Replace ad hoc helper registration with explicit capability metadata. This usually forces good cleanup in the tool catalog.

Phase 3: introduce deterministic policy

Canonicalize requests, snapshot entitlements, and emit decision reason codes. Make “deny by default” the baseline.

Phase 4: add approval artifacts for mutating actions

Move from conversational approval to signed, bounded approval tokens.

Phase 5: add replay and conformance checking

At this point you can start using real traffic as a regression corpus and answer incident-review questions with evidence instead of inference.

That sequence keeps the work grounded. Teams often try to start with an elaborate policy language before they have stable request shapes. That usually ends in policy churn because the input contract is still moving.

What this buys you operationally

A replayable authorization control plane changes the quality of your answers during an incident.

Instead of saying:

“the agent probably had access”
“we think the user approved it”
“the runtime should have blocked that”

you can say:

this canonical request hash was evaluated
this policy bundle and entitlement snapshot produced require_approval
this approver issued a single-use token for that exact decision
this executor used this service identity
this downstream envelope matched or diverged from the authorized request
current policy would or would not allow the same action today

That is the difference between observability and accountability.

Takeaways

If agentic systems are going to touch real infrastructure, they need more than prompt hardening and JSON schema checks. They need a real authorization control plane with deterministic inputs, explicit capability contracts, signed approvals, executor conformance, and replayable audit evidence.

The model can stay powerful without becoming privileged. That is the right split. Let the model propose. Let the control plane decide. Let the executor do exactly one approved thing, and leave behind enough evidence that a skeptical engineer can verify every step later.

blog

This post is licensed under CC BY 4.0 by the author.