C08 Layer 2 - Runtime Enforcement

Prompt and Content Injection Defense

Structural separation of instruction and data channels at the gateway boundary. Schema validation, guard-model scanning at higher tiers, quarantine of suspect inputs.

Why

Injection attacks cause the model to treat untrusted content as instructions.
Indirect injection (retrieved docs) is especially dangerous because it arrives “inside the context.”

What

A layered defense that: classifies content sources (trusted vs untrusted)
normalizes/strips untrusted markup and instruction-like patterns
enforces instruction hierarchy separation
escalates high-risk content to additional verification or HITL

How

strict input contracts: tools accept structured JSON, not free text
isolate system instructions from untrusted content in separate channels/fields
apply content normalization for HTML/PDF/email sources before model ingestion
optional guard scanning for known exploit patterns; throttle probing behavior

Evidence

detection metrics (block rate, false positives)
exploit success rate from adversarial test suite (C15)
provenance logs for retrieved sources

Failure modes

feeding raw HTML/PDF into the model
allowing untrusted content to modify system prompt
assuming internal documents cannot be malicious
no regression tests for injection scenarios

Treating C08 as a quality gate for retrieved content. C08 defends against adversarial inputs and instruction injection, not against stale, low-confidence, or unverified information. See the Memory flow scope note in the Reference Architecture and C18 (Data Quality Gates) for the boundary that covers content quality and freshness.

NIST AI RMF alignment

C08 maps to MANAGE and MEASURE. See the framework paper for the specific subcontrol mappings.

ISO/IEC alignment

C08 maps to ISO/IEC 27001. Typical evidence: see the Evidence section above.