Deterministic Replay
Snapshot the inputs at the tool/memory boundary; replay produces matching output hashes. Incidents become reproducible and root cause becomes provable.
Why
If you can’t replay, you can’t debug reliably, prove causality, or validate mitigations.
- Non-determinism becomes operational chaos.
What
- A replay trace that captures all non-determinism: model ID/version
- prompt bundle hash and tool schema hash
- temperature/seed
- retrieved context hashes
- tool requests/responses snapshots
orchestration routing decisions Determinism scope: Deterministic replay in GATE is defined at the governed execution boundary (tools and memory). Replay reproduces the run by reusing recorded snapshots and pinned bundles, ensuring the same request_hash/response_hash pairs and equivalent side-effect outcomes. This does not require identical token-by-token model output across providers unless model versions and execution conditions are fully pinned. Retrieved-context hashes recorded in the replay trace confirm what was retrieved at runtime, not whether the retrieved content was accurate or current. Replay reproduces the agent’s behaviour given the same inputs; it does not validate the inputs. See C18 (Data Quality Gates) for the retrieval-time quality boundary.
How
- write trace events as an append-only stream
- snapshot external tool responses (or store pointers to immutable snapshots)
- build a replay harness that stubs tool calls with recorded responses
- Architect’s Note - Replay cold start (expired identity/policy)
GATE replay is defined as “no live dependencies,” which includes control-plane dependencies that may change over time. A replay executed months later MUST NOT fail due to expired tokens, rotated keys, or updated policy bundles. The replay harness therefore MUST provide local mocks (or recorded fixtures) for: Identity Provider / Attestation verification: return the recorded workload identity claims and attestation status for the run being replayed (verification must succeed against recorded evidence, not current tokens).
- Policy Engine evaluation: replay MUST use the recorded policy_bundle_hash and decision fixtures (or a policy engine loaded with the archived bundle) so decisions reproduce independently of current policy state.
Normative requirement: Replay execution must validate authenticity by verifying recorded signatures and hashes (decision records, ledger events, request/response hashes) rather than requiring live token refresh or current IAM state.
Evidence
- replay success rate
- mean time to reproduce an incident
- regression tests built from incident traces
Failure modes
- missing tool outputs (replay diverges)
- replay uses live external dependencies
- model/prompt versions not pinned
NIST AI RMF alignment
C10 maps to MEASURE and MANAGE. See the framework paper for the specific subcontrol mappings.
ISO/IEC alignment
C10 maps to ISO/IEC 27001. Typical evidence: see the Evidence section above.