Diff: architecture/decode-time-constraints.es

From 5a7d4c7 to 5a7d4c7

+7 / −7 lines

Before	After
---	---
schema: foundry-doc-v1	schema: foundry-doc-v1
title: "Decode-time constraints"	title: "Decode-time constraints"
slug: decode-time-constraints	slug: decode-time-constraints
category: architecture	category: architecture
type: topic	type: topic
quality: complete	quality: complete
short_description: "Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact."	short_description: "Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact."
status: active	status: active
bcsc_class: public-disclosure-safe	bcsc_class: public-disclosure-safe
last_edited: 2026-04-30	last_edited: 2026-04-30
editor: pointsav-engineering	editor: pointsav-engineering
cites:	cites:
- ni-51-102	- ni-51-102
- llguidance	- llguidance
- llm-structured-output-2026	- llm-structured-output-2026
- vllm-multi-lora	- vllm-multi-lora
- xgrammar	- xgrammar
- olmo3-allenai	- olmo3-allenai
paired_with: decode-time-constraints.es.md	paired_with: decode-time-constraints.es.md
---	---


> Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact.	> Decode-time constraints are structural rules applied to a language model's output at each token-emission step, making banned vocabulary or structurally invalid responses mathematically impossible to produce rather than catching them after the fact.

Decode-time constraints are structural rules the ~~PointSav~~ substrate enforces at the moment a language model emits each token, not after the response is finished. When a rule says "no banned-vocabulary words" or "must produce valid JSON", the runtime makes the violating token mathematically impossible — the model picks from the remaining valid tokens. This is the difference between a human grading work after submission and a guard rail that prevents the violation from happening at emission. The constraint takes the form of a context-free grammar (CFG) or finite-state automaton; the runtime computes — token by token — which next-token candidates would still satisfy the grammar, and zeros out the probability of all others.	Decode-time constraints are structural rules the [[pointsav-overview\|PointSav]] substrate enforces at the moment a language model emits each token, not after the response is finished. When a rule says "no banned-vocabulary words" or "must produce valid JSON", the runtime makes the violating token mathematically impossible — the model picks from the remaining valid tokens. This is the difference between a human grading work after submission and a guard rail that prevents the violation from happening at emission. The constraint takes the form of a context-free grammar (CFG) or finite-state automaton; the runtime computes — token by token — which next-token candidates would still satisfy the grammar, and zeros out the probability of all others. See also [[language-protocol-substrate\|the language protocol substrate]] and [[sovereign-ai-routing\|sovereign AI routing]].

This technique is called constrained decoding, structured generation, or grammar-guided generation. Implementations include Microsoft Research's `[llguidance]` library, Carnegie Mellon's `[xgrammar]`, vLLM's structured outputs `[vllm-multi-lora]`, and a growing body of literature on `[llm-structured-output-2026]`.	This technique is called constrained decoding, structured generation, or grammar-guided generation. Implementations include Microsoft Research's `[llguidance]` library, Carnegie Mellon's `[xgrammar]`, vLLM's structured outputs `[vllm-multi-lora]`, and a growing body of literature on `[llm-structured-output-2026]`.

## Overview	## Overview

The artefact a content session holds in their head: a `.lark` grammar file says what a valid response looks like. The runtime makes invalid tokens unreachable. There is no "but what if the model emits a banned word" — the banned word literally cannot be sampled.	The artefact a content session holds in their head: a `.lark` grammar file says what a valid response looks like. The runtime makes invalid tokens unreachable. There is no "but what if the model emits a banned word" — the banned word literally cannot be sampled.

The substrate ships ~~`service-content/schemas/banned-vocab.lark`~~ — a Lark EBNF grammar declaring eight banned editorial terms (`leverage`, `empower`, `next-generation`, `industry-leading`, `seamless`, `robust`, `cutting-edge`, `world-class`) plus a backtick-quoted-escape rule. The grammar's top-level rule `response` allows any token that is not one of the eight banned forms (case-insensitive); backtick-quoted segments are exempt so that documents can quote a banned term without violating the rule.	The substrate ships [[service-content\|`service-content/schemas/banned-vocab.lark`]] — a Lark EBNF grammar declaring eight banned editorial terms (`leverage`, `empower`, `next-generation`, `industry-leading`, `seamless`, `robust`, `cutting-edge`, `world-class`) plus a backtick-quoted-escape rule. The grammar's top-level rule `response` allows any token that is not one of the eight banned forms (case-insensitive); backtick-quoted segments are exempt so that documents can quote a banned term without violating the rule.

## How It Works	## How It Works

Production inference at Tier A (local OLMo 3 7B per `[olmo3-allenai]`) and Tier B ~~(Yo-Yo~~ ~~bursting)~~ loads the grammar via `[llguidance]` and applies it at decode time. Editorial-grade workspace validation (`validate.py`) runs the same grammar in Lark mode for offline checks before content ships.	Production inference at Tier A (local OLMo 3 7B per `[olmo3-allenai]`) and Tier B ([[yoyo-compute-substrate\|Yo-Yo bursting]]) loads the grammar via `[llguidance]` and applies it at decode time. Editorial-grade workspace validation (`validate.py`) runs the same grammar in Lark mode for offline checks before content ships.

The pattern composes with the ~~language-protocol-substrate:~~ each genre template (TOPIC, GUIDE, README, contract, policy, and the rest) ships a per-genre grammar fragment. At inference time, the active grammar is `base-grammar ⊕ tenant-grammar ⊕ genre-grammar` — substrate-tier rules combined with tenant-tier customisations combined with the request's genre.	The pattern composes with the [[language-protocol-substrate\|language-protocol substrate]]: each genre template (TOPIC, GUIDE, README, contract, policy, and the rest) ships a per-genre grammar fragment. At inference time, the active grammar is `base-grammar ⊕ tenant-grammar ⊕ genre-grammar` — substrate-tier rules combined with tenant-tier customisations combined with the request's genre.

## Architecture	## Architecture

The constraint system is layered:	The constraint system is layered:

1. Base grammar — universal banned-vocabulary rules applying to every tenant and every genre.	1. Base grammar — universal banned-vocabulary rules applying to every tenant and every genre.
2. Tenant grammar — per-customer extensions (brand-specific Do-Not-Use words, required citation density rules, prohibited claim patterns). Authored locally by the tenant and loaded by the Doorman.	2. Tenant grammar — per-customer extensions (brand-specific Do-Not-Use words, required citation density rules, prohibited claim patterns). Authored locally by the tenant and loaded by the Doorman.
3. Genre grammar — per-genre structural rules (a TOPIC must have a lead paragraph; a GUIDE must have numbered steps; a regulatory disclosure must carry specific citation fields).	3. Genre grammar — per-genre structural rules (a TOPIC must have a lead paragraph; a GUIDE must have numbered steps; a regulatory disclosure must carry specific citation fields).

At request time, the ~~Doorman~~ ~~(`service-slm`)~~ composes the three grammar layers, loads the result into the inference runtime, and runs decoding with the composed constraint active.	At request time, the [[doorman-protocol\|Doorman]] ([[service-slm]]) composes the three grammar layers, loads the result into the inference runtime, and runs decoding with the composed constraint active.

## Applications	## Applications

The editorial path becomes structurally auditable:	The editorial path becomes structurally auditable:

- A TOPIC committed to a content-wiki repo cannot contain a banned-vocab term — the grammar refused to emit one.	- A TOPIC committed to a content-wiki repo cannot contain a banned-vocab term — the grammar refused to emit one.
- A GUIDE rendered for a customer cannot contain forbidden tenant-specific terms — that tenant's grammar forbade them.	- A GUIDE rendered for a customer cannot contain forbidden tenant-specific terms — that tenant's grammar forbade them.
- A regulatory disclosure draft cannot omit a required citation pattern — the grammar required it.	- A regulatory disclosure draft cannot omit a required citation pattern — the grammar required it.

The discipline shifts from human-grading-after-submission to runtime-impossibility-at-emission. This is the substrate enforcement layer the ~~Compounding~~ ~~Substrate's~~ federated-compounding property depends on; without it, federated training would propagate banned-vocabulary contamination from any tenant's training data into the next year's base model.	The discipline shifts from human-grading-after-submission to runtime-impossibility-at-emission. This is the substrate enforcement layer the [[compounding-substrate\|Compounding Substrate]]'s federated-compounding property depends on; without it, federated training would propagate banned-vocabulary contamination from any tenant's training data into the next year's base model.

## Limitations	## Limitations

Three structural reasons hyperscaler-managed AI cannot match this approach:	Three structural reasons hyperscaler-managed AI cannot match this approach:

1. The grammar must be authored locally. A constraint that lives at decode time runs inside the inference loop. To author a grammar specific to a tenant's editorial standards, the tenant needs write access to the grammar file the runtime loads. Hyperscaler-managed AI products treat the grammar as part of the closed model deployment — tenants get structured-output modes, not a tenant-specific grammar that loads at inference time.	1. The grammar must be authored locally. A constraint that lives at decode time runs inside the inference loop. To author a grammar specific to a tenant's editorial standards, the tenant needs write access to the grammar file the runtime loads. Hyperscaler-managed AI products treat the grammar as part of the closed model deployment — tenants get structured-output modes, not a tenant-specific grammar that loads at inference time.

2. The constraint must compose with adapter routing. The platform's Doorman routes among three compute tiers and composes adapters per request. Decode-time constraints must travel with the adapter composition. Hyperscaler-managed AI does not expose adapter composition primitives, let alone constraint composition.	2. The constraint must compose with adapter routing. The platform's Doorman routes among three compute tiers and composes adapters per request. Decode-time constraints must travel with the adapter composition. Hyperscaler-managed AI does not expose adapter composition primitives, let alone constraint composition.

3. The constraint must be auditable. Per `[ni-51-102]` continuous-disclosure language, every editorial output must be traceable to the rules it was generated under. The per-tenant audit ledger captures the grammar version, the adapter composition, and the response — together. Hyperscaler-managed AI offers neither the grammar version nor the adapter composition for inspection.	3. The constraint must be auditable. Per `[ni-51-102]` continuous-disclosure language, every editorial output must be traceable to the rules it was generated under. The per-tenant audit ledger captures the grammar version, the adapter composition, and the response — together. Hyperscaler-managed AI offers neither the grammar version nor the adapter composition for inspection.

## Forward-Looking	## Forward-Looking

Per `[ni-51-102]` continuous-disclosure language, the trajectory below is `planned` and `intended`:	Per `[ni-51-102]` continuous-disclosure language, the trajectory below is `planned` and `intended`:

- Per-genre grammars for the 16 genre templates currently in `service-disclosure/templates/` (Phase 1B grammar covers the universal banned-vocab; per-genre grammars are subsequent work).	- Per-genre grammars for the 16 genre templates currently in `service-disclosure/templates/` (Phase 1B grammar covers the universal banned-vocab; per-genre grammars are subsequent work).
- Per-tenant banned-vocab extensions (for example, a customer's brand-specific Do-Not-Use words).	- Per-tenant banned-vocab extensions (for example, a customer's brand-specific Do-Not-Use words).
- Live ~~adapter~~ ~~composition~~ with grammar composition through ~~`service-slm`'s~~ ~~Doorman.~~	- Live [[adapter-composition\|adapter composition]] with grammar composition through [[service-slm]]'s [[doorman-protocol\|Doorman]].
- Audit-ledger entries recording `grammar_version + adapter_composition + response_hash` per request.	- Audit-ledger entries recording `grammar_version + adapter_composition + response_hash` per request.

## See also	## See also

- [[compounding-substrate]]	- [[compounding-substrate]]
- [[language-protocol-substrate]]	- [[language-protocol-substrate]]
- [[apprenticeship-substrate]]	- [[apprenticeship-substrate]]
- [[sovereign-ai-routing]]	- [[sovereign-ai-routing]]
- [[worm-ledger-architecture]]	- [[worm-ledger-architecture]]