C10 Layer 3 - Observability & Forensics

Deterministic Replay

Snapshot the inputs at the tool/memory boundary; replay produces matching output hashes. Incidents become reproducible and root cause becomes provable.

Why

If you can’t replay, you can’t debug reliably, prove causality, or validate mitigations.

Non-determinism becomes operational chaos.

What

A replay trace that captures all non-determinism: model ID/version
prompt bundle hash and tool schema hash
temperature/seed
retrieved context hashes
tool requests/responses snapshots

orchestration routing decisions Determinism scope: Deterministic replay in GATE is defined at the governed execution boundary (tools and memory). Replay reproduces the run by reusing recorded snapshots and pinned bundles, ensuring the same request_hash/response_hash pairs and equivalent side-effect outcomes. This does not require identical token-by-token model output across providers unless model versions and execution conditions are fully pinned. Retrieved-context hashes recorded in the replay trace confirm what was retrieved at runtime, not whether the retrieved content was accurate or current. Replay reproduces the agent’s behaviour given the same inputs; it does not validate the inputs. See C18 (Data Quality Gates) for the retrieval-time quality boundary.

How

write trace events as an append-only stream
snapshot external tool responses (or store pointers to immutable snapshots)
build a replay harness that stubs tool calls with recorded responses
Architect’s Note - Replay cold start (expired identity/policy)

GATE replay is defined as “no live dependencies,” which includes control-plane dependencies that may change over time. A replay executed months later MUST NOT fail due to expired tokens, rotated keys, or updated policy bundles. The replay harness therefore MUST provide local mocks (or recorded fixtures) for: Identity Provider / Attestation verification: return the recorded workload identity claims and attestation status for the run being replayed (verification must succeed against recorded evidence, not current tokens).

Policy Engine evaluation: replay MUST use the recorded policy_bundle_hash and decision fixtures (or a policy engine loaded with the archived bundle) so decisions reproduce independently of current policy state.

Normative requirement: Replay execution must validate authenticity by verifying recorded signatures and hashes (decision records, ledger events, request/response hashes) rather than requiring live token refresh or current IAM state.

Evidence

replay success rate
mean time to reproduce an incident
regression tests built from incident traces

Failure modes

missing tool outputs (replay diverges)
replay uses live external dependencies
model/prompt versions not pinned

NIST AI RMF alignment

C10 maps to MEASURE and MANAGE. See the framework paper for the specific subcontrol mappings.

ISO/IEC alignment

C10 maps to ISO/IEC 27001. Typical evidence: see the Evidence section above.