Diff: architecture/decode-time-constraints
From b069558 to b069558
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "Decode-time constraints" | title: "Decode-time constraints" |
| slug: decode-time-constraints | slug: decode-time-constraints |
| category: architecture | category: architecture |
| type: topic | type: topic |
| quality: complete | quality: complete |
| short_description: "Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact." | short_description: "Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact." |
| status: active | status: active |
| bcsc_class: public-disclosure-safe | bcsc_class: public-disclosure-safe |
| last_edited: 2026-04-30 | last_edited: 2026-04-30 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| cites: | cites: |
| - ni-51-102 | - ni-51-102 |
| - llguidance | - llguidance |
| - llm-structured-output-2026 | - llm-structured-output-2026 |
| - vllm-multi-lora | - vllm-multi-lora |
| - xgrammar | - xgrammar |
| - olmo3-allenai | - olmo3-allenai |
| paired_with: decode-time-constraints.es.md | paired_with: decode-time-constraints.es.md |
| --- | --- |
| > Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact. | > Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact. |
| **Decode-time constraints** are structural rules the [[pointsav-overview|PointSav]] substrate enforces at the moment a language model emits each token, not after the response is finished. When a rule says "no banned-vocabulary words" or "must produce valid JSON", the runtime makes the violating token mathematically impossible — the model picks from the remaining valid tokens. This is the difference between a human grading work after submission and a guard rail that prevents the violation from happening at emission. The constraint takes the form of a context-free grammar (CFG) or finite-state automaton; the runtime computes — token by token — which next-token candidates would still satisfy the grammar, and zeros out the probability of all others. See also [[language-protocol-substrate|the language protocol substrate]] and [[sovereign-ai-routing|sovereign AI routing]]. | **Decode-time constraints** are structural rules the [[pointsav-overview|PointSav]] substrate enforces at the moment a language model emits each token, not after the response is finished. When a rule says "no banned-vocabulary words" or "must produce valid JSON", the runtime makes the violating token mathematically impossible — the model picks from the remaining valid tokens. This is the difference between a human grading work after submission and a guard rail that prevents the violation from happening at emission. The constraint takes the form of a context-free grammar (CFG) or finite-state automaton; the runtime computes — token by token — which next-token candidates would still satisfy the grammar, and zeros out the probability of all others. See also [[language-protocol-substrate|the language protocol substrate]] and [[sovereign-ai-routing|sovereign AI routing]]. |
| This technique is called constrained decoding, structured generation, or grammar-guided generation. Implementations include Microsoft Research's `[llguidance]` library, Carnegie Mellon's `[xgrammar]`, vLLM's structured outputs `[vllm-multi-lora]`, and a growing body of literature on `[llm-structured-output-2026]`. | This technique is called constrained decoding, structured generation, or grammar-guided generation. Implementations include Microsoft Research's `[llguidance]` library, Carnegie Mellon's `[xgrammar]`, vLLM's structured outputs `[vllm-multi-lora]`, and a growing body of literature on `[llm-structured-output-2026]`. |
| ## Overview | ## Overview |
| The artefact a content session holds in their head: a `.lark` grammar file says what a valid response looks like. The runtime makes invalid tokens unreachable. There is no "but what if the model emits a banned word" — the banned word literally cannot be sampled. | The artefact a content session holds in their head: a `.lark` grammar file says what a valid response looks like. The runtime makes invalid tokens unreachable. There is no "but what if the model emits a banned word" — the banned word literally cannot be sampled. |
| The substrate ships [[service-content|`service-content/schemas/banned-vocab.lark`]] — a Lark EBNF grammar declaring eight banned editorial terms (`leverage`, `empower`, `next-generation`, `industry-leading`, `seamless`, `robust`, `cutting-edge`, `world-class`) plus a backtick-quoted-escape rule. The grammar's top-level rule `response` allows any token that is not one of the eight banned forms (case-insensitive); backtick-quoted segments are exempt so that documents can quote a banned term without violating the rule. | The substrate ships [[service-content|`service-content/schemas/banned-vocab.lark`]] — a Lark EBNF grammar declaring eight banned editorial terms (`leverage`, `empower`, `next-generation`, `industry-leading`, `seamless`, `robust`, `cutting-edge`, `world-class`) plus a backtick-quoted-escape rule. The grammar's top-level rule `response` allows any token that is not one of the eight banned forms (case-insensitive); backtick-quoted segments are exempt so that documents can quote a banned term without violating the rule. |
| ## How It Works | ## How It Works |
| Production inference at Tier A (local OLMo 3 7B per `[olmo3-allenai]`) and Tier B ([[yoyo-compute-substrate|Yo-Yo bursting]]) loads the grammar via `[llguidance]` and applies it at decode time. Editorial-grade workspace validation (`validate.py`) runs the same grammar in Lark mode for offline checks before content ships. | Production inference at Tier A (local OLMo 3 7B per `[olmo3-allenai]`) and Tier B ([[yoyo-compute-substrate|Yo-Yo bursting]]) loads the grammar via `[llguidance]` and applies it at decode time. Editorial-grade workspace validation (`validate.py`) runs the same grammar in Lark mode for offline checks before content ships. |
| The pattern composes with the [[language-protocol-substrate|language-protocol substrate]]: each genre template (TOPIC, GUIDE, README, contract, policy, and the rest) ships a per-genre grammar fragment. At inference time, the active grammar is `base-grammar ⊕ tenant-grammar ⊕ genre-grammar` — substrate-tier rules combined with tenant-tier customisations combined with the request's genre. | The pattern composes with the [[language-protocol-substrate|language-protocol substrate]]: each genre template (TOPIC, GUIDE, README, contract, policy, and the rest) ships a per-genre grammar fragment. At inference time, the active grammar is `base-grammar ⊕ tenant-grammar ⊕ genre-grammar` — substrate-tier rules combined with tenant-tier customisations combined with the request's genre. |
| ## Architecture | ## Architecture |
| The constraint system is layered: | The constraint system is layered: |
| 1. **Base grammar** — universal banned-vocabulary rules applying to every tenant and every genre. | 1. **Base grammar** — universal banned-vocabulary rules applying to every tenant and every genre. |
| 2. **Tenant grammar** — per-customer extensions (brand-specific Do-Not-Use words, required citation density rules, prohibited claim patterns). Authored locally by the tenant and loaded by the Doorman. | 2. **Tenant grammar** — per-customer extensions (brand-specific Do-Not-Use words, required citation density rules, prohibited claim patterns). Authored locally by the tenant and loaded by the Doorman. |
| 3. **Genre grammar** — per-genre structural rules (a TOPIC must have a lead paragraph; a GUIDE must have numbered steps; a regulatory disclosure must carry specific citation fields). | 3. **Genre grammar** — per-genre structural rules (a TOPIC must have a lead paragraph; a GUIDE must have numbered steps; a regulatory disclosure must carry specific citation fields). |
| At request time, the [[doorman-protocol|Doorman]] ([[service-slm]]) composes the three grammar layers, loads the result into the inference runtime, and runs decoding with the composed constraint active. | At request time, the [[doorman-protocol|Doorman]] ([[service-slm]]) composes the three grammar layers, loads the result into the inference runtime, and runs decoding with the composed constraint active. |
| ## Applications | ## Applications |
| The editorial path becomes structurally auditable: | The editorial path becomes structurally auditable: |
| - A TOPIC committed to a content-wiki repo cannot contain a banned-vocab term — the grammar refused to emit one. | - A TOPIC committed to a content-wiki repo cannot contain a banned-vocab term — the grammar refused to emit one. |
| - A GUIDE rendered for a customer cannot contain forbidden tenant-specific terms — that tenant's grammar forbade them. | - A GUIDE rendered for a customer cannot contain forbidden tenant-specific terms — that tenant's grammar forbade them. |
| - A regulatory disclosure draft cannot omit a required citation pattern — the grammar required it. | - A regulatory disclosure draft cannot omit a required citation pattern — the grammar required it. |
| The discipline shifts from human-grading-after-submission to runtime-impossibility-at-emission. This is the substrate enforcement layer the [[compounding-substrate|Compounding Substrate]]'s federated-compounding property depends on; without it, federated training would propagate banned-vocabulary contamination from any tenant's training data into the next year's base model. | The discipline shifts from human-grading-after-submission to runtime-impossibility-at-emission. This is the substrate enforcement layer the [[compounding-substrate|Compounding Substrate]]'s federated-compounding property depends on; without it, federated training would propagate banned-vocabulary contamination from any tenant's training data into the next year's base model. |
| ## Limitations | ## Limitations |
| Three structural reasons hyperscaler-managed AI cannot match this approach: | Three structural reasons hyperscaler-managed AI cannot match this approach: |
| **1. The grammar must be authored locally.** A constraint that lives at decode time runs inside the inference loop. To author a grammar specific to a tenant's editorial standards, the tenant needs write access to the grammar file the runtime loads. Hyperscaler-managed AI products treat the grammar as part of the closed model deployment — tenants get structured-output modes, not a tenant-specific grammar that loads at inference time. | **1. The grammar must be authored locally.** A constraint that lives at decode time runs inside the inference loop. To author a grammar specific to a tenant's editorial standards, the tenant needs write access to the grammar file the runtime loads. Hyperscaler-managed AI products treat the grammar as part of the closed model deployment — tenants get structured-output modes, not a tenant-specific grammar that loads at inference time. |
| **2. The constraint must compose with adapter routing.** The platform's Doorman routes among three compute tiers and composes adapters per request. Decode-time constraints must travel with the adapter composition. Hyperscaler-managed AI does not expose adapter composition primitives, let alone constraint composition. | **2. The constraint must compose with adapter routing.** The platform's Doorman routes among three compute tiers and composes adapters per request. Decode-time constraints must travel with the adapter composition. Hyperscaler-managed AI does not expose adapter composition primitives, let alone constraint composition. |
| **3. The constraint must be auditable.** Per `[ni-51-102]` continuous-disclosure language, every editorial output must be traceable to the rules it was generated under. The per-tenant audit ledger captures the grammar version, the adapter composition, and the response — together. Hyperscaler-managed AI offers neither the grammar version nor the adapter composition for inspection. | **3. The constraint must be auditable.** Per `[ni-51-102]` continuous-disclosure language, every editorial output must be traceable to the rules it was generated under. The per-tenant audit ledger captures the grammar version, the adapter composition, and the response — together. Hyperscaler-managed AI offers neither the grammar version nor the adapter composition for inspection. |
| ## Forward-Looking | ## Forward-Looking |
| Per `[ni-51-102]` continuous-disclosure language, the trajectory below is `planned` and `intended`: | Per `[ni-51-102]` continuous-disclosure language, the trajectory below is `planned` and `intended`: |
| - Per-genre grammars for the 16 genre templates currently in `service-disclosure/templates/` (Phase 1B grammar covers the universal banned-vocab; per-genre grammars are subsequent work). | - Per-genre grammars for the 16 genre templates currently in `service-disclosure/templates/` (Phase 1B grammar covers the universal banned-vocab; per-genre grammars are subsequent work). |
| - Per-tenant banned-vocab extensions (for example, a customer's brand-specific Do-Not-Use words). | - Per-tenant banned-vocab extensions (for example, a customer's brand-specific Do-Not-Use words). |
| - Live [[adapter-composition|adapter composition]] with grammar composition through [[service-slm]]'s [[doorman-protocol|Doorman]]. | - Live [[adapter-composition|adapter composition]] with grammar composition through [[service-slm]]'s [[doorman-protocol|Doorman]]. |
| - Audit-ledger entries recording `grammar_version + adapter_composition + response_hash` per request. | - Audit-ledger entries recording `grammar_version + adapter_composition + response_hash` per request. |
| ## See also | ## See also |
| - [[compounding-substrate]] | - [[compounding-substrate]] |
| - [[language-protocol-substrate]] | - [[language-protocol-substrate]] |
| - [[apprenticeship-substrate]] | - [[apprenticeship-substrate]] |
| - [[sovereign-ai-routing]] | - [[sovereign-ai-routing]] |
| - [[worm-ledger-architecture]] | - [[worm-ledger-architecture]] |