Diff: substrate/soft-slm-tiered-gateway.es

From 1c02ec1 to 1c02ec1

+0 / −0 lines

Before	After
---	---
schema: foundry-doc-v1	schema: foundry-doc-v1
title: "The tiered inference gateway — local-first AI routing"	title: "The tiered inference gateway — local-first AI routing"
slug: soft-slm-tiered-gateway	slug: soft-slm-tiered-gateway
category: substrate	category: substrate
type: topic	type: topic
content_type: topic	content_type: topic
quality: complete	quality: complete
short_description: "A tiered inference gateway that routes AI requests through a local model first, escalating to remote GPU nodes and external APIs only when the local tier cannot serve — minimizing latency, cost, and data exposure while preserving full capability on demand."	short_description: "A tiered inference gateway that routes AI requests through a local model first, escalating to remote GPU nodes and external APIs only when the local tier cannot serve — minimizing latency, cost, and data exposure while preserving full capability on demand."
status: active	status: active
bcsc_class: public-disclosure-safe	bcsc_class: public-disclosure-safe
last_edited: 2026-06-09	last_edited: 2026-06-09
editor: pointsav-engineering	editor: pointsav-engineering
cites: []	cites: []
references: []	references: []
paired_with: soft-slm-tiered-gateway.es.md	paired_with: soft-slm-tiered-gateway.es.md
supersedes: slm-tiered-substrate.md	supersedes: slm-tiered-substrate.md
---	---

# The tiered inference gateway — local-first AI routing	# The tiered inference gateway — local-first AI routing

A tiered inference gateway routes every AI request through a hierarchy of compute	A tiered inference gateway routes every AI request through a hierarchy of compute
tiers, selecting the least expensive capable tier for each request. Routine work runs	tiers, selecting the least expensive capable tier for each request. Routine work runs
on hardware the organization owns. Burst capacity on rented GPU handles work that	on hardware the organization owns. Burst capacity on rented GPU handles work that
exceeds local capability. An external commercial API provides a final fallback. Each	exceeds local capability. An external commercial API provides a final fallback. Each
tier degrades gracefully to the one below it; no tier is a single point of failure.	tier degrades gracefully to the one below it; no tier is a single point of failure.

## Why local-first matters	## Why local-first matters

Routing all inference to an external service is simple to operate but has structural	Routing all inference to an external service is simple to operate but has structural
costs. Every request crosses an organizational boundary, exposing the content of	costs. Every request crosses an organizational boundary, exposing the content of
prompts and responses to a third-party provider. Cost is proportional to usage with	prompts and responses to a third-party provider. Cost is proportional to usage with
no amortization. The organization has no way to adapt the model to its own vocabulary,	no amortization. The organization has no way to adapt the model to its own vocabulary,
processes, or accumulated knowledge.	processes, or accumulated knowledge.

A local-first gateway eliminates these costs for the majority of work. The local model	A local-first gateway eliminates these costs for the majority of work. The local model
handles requests that fall within its capability. External resources handle what it	handles requests that fall within its capability. External resources handle what it
cannot. Over time, the local model improves through usage-derived training, narrowing	cannot. Over time, the local model improves through usage-derived training, narrowing
the set of requests that require external compute.	the set of requests that require external compute.

## The three tiers	## The three tiers

### Tier A — local inference	### Tier A — local inference

Tier A is an inference server running on the organization's own hardware. It is	Tier A is an inference server running on the organization's own hardware. It is
always running, produces responses in seconds, and costs nothing per request beyond	always running, produces responses in seconds, and costs nothing per request beyond
the amortized hardware. It is the default destination for all requests.	the amortized hardware. It is the default destination for all requests.

The local model is purpose-trained or fine-tuned for the organization's domain.	The local model is purpose-trained or fine-tuned for the organization's domain.
It is smaller and faster than models at higher tiers. It answers the majority of	It is smaller and faster than models at higher tiers. It answers the majority of
routine requests competently: summarization, classification, entity extraction from	routine requests competently: summarization, classification, entity extraction from
known document types, code generation in known patterns.	known document types, code generation in known patterns.

Tier A does not handle grammar-constrained structured output well at small model	Tier A does not handle grammar-constrained structured output well at small model
sizes. Requests requiring precise JSON schema enforcement are routed to Tier B.	sizes. Requests requiring precise JSON schema enforcement are routed to Tier B.

### Tier B — burst GPU node	### Tier B — burst GPU node

Tier B is one or more remote inference nodes running a larger, more capable model	Tier B is one or more remote inference nodes running a larger, more capable model
on dedicated GPU hardware. Nodes start on demand and stop when idle, so the cost is	on dedicated GPU hardware. Nodes start on demand and stop when idle, so the cost is
proportional to actual use rather than availability.	proportional to actual use rather than availability.

The gateway maintains a per-node circuit breaker and a VM lifecycle state machine.	The gateway maintains a per-node circuit breaker and a VM lifecycle state machine.
When a Tier B request arrives and the target node is stopped, the gateway starts it	When a Tier B request arrives and the target node is stopped, the gateway starts it
automatically. The caller receives a 202 Accepted response with a polling endpoint	automatically. The caller receives a 202 Accepted response with a polling endpoint
while the node boots. Once the node is ready, the request is served.	while the node boots. Once the node is ready, the request is served.

When the node is running, requests are dispatched immediately. When the circuit	When the node is running, requests are dispatched immediately. When the circuit
breaker opens — after consecutive health probe failures — requests fall back to	breaker opens — after consecutive health probe failures — requests fall back to
Tier A or queue until the node recovers.	Tier A or queue until the node recovers.

Tier B nodes are organized by label. A `batch` label handles background work: corpus	Tier B nodes are organized by label. A `batch` label handles background work: corpus
extraction, training data processing, scheduled maintenance. An `express` label handles	extraction, training data processing, scheduled maintenance. An `express` label handles
time-sensitive work that cannot wait for a cold start.	time-sensitive work that cannot wait for a cold start.

### Tier C — external inference provider	### Tier C — external inference provider

Tier C is an optional connection to a commercial language model API. It serves as a	Tier C is an optional connection to a commercial language model API. It serves as a
final fallback when both Tier A and Tier B are unavailable, and as a direct route for	final fallback when both Tier A and Tier B are unavailable, and as a direct route for
tasks the organization has explicitly designated as external.	tasks the organization has explicitly designated as external.

Tier C is never used as a source of training data. An organization that routes	Tier C is never used as a source of training data. An organization that routes
inference to an external provider and then fine-tunes on those responses is	inference to an external provider and then fine-tunes on those responses is
building a dependency on a third party's output quality and terms of service.	building a dependency on a third party's output quality and terms of service.
The boundary is enforced in the gateway's routing logic.	The boundary is enforced in the gateway's routing logic.

Tier C requires an explicit API key to activate. Without the key, requests that	Tier C requires an explicit API key to activate. Without the key, requests that
reach Tier C fall back to Tier A.	reach Tier C fall back to Tier A.

## Request routing	## Request routing

Every request carries a complexity hint and, optionally, a tier label. The gateway	Every request carries a complexity hint and, optionally, a tier label. The gateway
selects the tier using this decision sequence:	selects the tier using this decision sequence:

1. If a kill switch is closed for the requested tier, the request is rejected or	1. If a kill switch is closed for the requested tier, the request is rejected or
falls to the next tier depending on configuration.	falls to the next tier depending on configuration.
2. If an explicit tier label is present, the request is routed to that tier.	2. If an explicit tier label is present, the request is routed to that tier.
3. If no label is present, the routing policy applies:	3. If no label is present, the routing policy applies:
- `balanced`: low and medium complexity → Tier A; high complexity → Tier B.	- `balanced`: low and medium complexity → Tier A; high complexity → Tier B.
- `drain-batch`: all non-express work routes to the batch node.	- `drain-batch`: all non-express work routes to the batch node.
- `drain-express`: all work routes to the express node to clear a backlog.	- `drain-express`: all work routes to the express node to clear a backlog.
- `local-only`: all work routes to Tier A regardless of complexity.	- `local-only`: all work routes to Tier A regardless of complexity.
4. If the selected tier is unavailable, the request falls to the next tier unless	4. If the selected tier is unavailable, the request falls to the next tier unless
tier affinity is required (for example, structured extraction requires Tier B and	tier affinity is required (for example, structured extraction requires Tier B and
does not fall back to Tier A).	does not fall back to Tier A).

The routing policy is configurable at runtime without restarting the gateway:	The routing policy is configurable at runtime without restarting the gateway:

```	```
POST /v1/flow/policy { "policy": "balanced" }	POST /v1/flow/policy { "policy": "balanced" }
```	```

## The kill switch	## The kill switch

Every tier has an independent kill switch. Closing a kill switch stops all new	Every tier has an independent kill switch. Closing a kill switch stops all new
dispatching to that tier immediately. In-flight requests complete; no new requests	dispatching to that tier immediately. In-flight requests complete; no new requests
start. Queued work accumulates and drains when the kill switch is reopened.	start. Queued work accumulates and drains when the kill switch is reopened.

The kill switch is the operator's billing control. Closing the express node switch	The kill switch is the operator's billing control. Closing the express node switch
stops the A100 from starting; the cost drops to zero. Closing the global switch	stops the A100 from starting; the cost drops to zero. Closing the global switch
stops all Tier B and Tier C spending while allowing Tier A to continue serving.	stops all Tier B and Tier C spending while allowing Tier A to continue serving.

The express lane — which bypasses the file-backed queue for time-sensitive work —	The express lane — which bypasses the file-backed queue for time-sensitive work —
still checks the kill switch. Nothing bypasses the kill switch.	still checks the kill switch. Nothing bypasses the kill switch.

## The priority queue	## The priority queue

Background work — apprenticeship briefs, corpus extraction, training corpus generation	Background work — apprenticeship briefs, corpus extraction, training corpus generation
— is processed through a file-backed priority queue with three levels:	— is processed through a file-backed priority queue with three levels:

- P0 routes exclusively to the local model for lightweight classification.	- P0 routes exclusively to the local model for lightweight classification.
- P1 routes to the batch GPU node for extraction work requiring structured output.	- P1 routes to the batch GPU node for extraction work requiring structured output.
- P2 routes to the batch GPU node for training corpus generation and similar	- P2 routes to the batch GPU node for training corpus generation and similar
long-running background tasks.	long-running background tasks.

The queue drain worker processes one item from each level per cycle, in P0 → P1 → P2	The queue drain worker processes one item from each level per cycle, in P0 → P1 → P2
order, then repeats. This prevents a large batch of P2 work from blocking P1	order, then repeats. This prevents a large batch of P2 work from blocking P1
extraction tasks for an extended period.	extraction tasks for an extended period.

## Organizational memory context	## Organizational memory context

Before dispatching any request to any tier, the gateway queries the organizational	Before dispatching any request to any tier, the gateway queries the organizational
knowledge graph for entities relevant to the current request. Matching entities are	knowledge graph for entities relevant to the current request. Matching entities are
injected into the system prompt as structured context. The model sees the	injected into the system prompt as structured context. The model sees the
organization's known relationships, decisions, and policies without those facts	organization's known relationships, decisions, and policies without those facts
needing to be re-derived from inference.	needing to be re-derived from inference.

This context injection is non-fatal: if the graph service is unavailable, the request	This context injection is non-fatal: if the graph service is unavailable, the request
proceeds without context. A circuit breaker on the graph query path prevents a	proceeds without context. A circuit breaker on the graph query path prevents a
slow graph service from blocking inference.	slow graph service from blocking inference.

## The MCP server	## The MCP server

The gateway exposes an organizational memory interface via the Model Context Protocol	The gateway exposes an organizational memory interface via the Model Context Protocol
at a second port. Any MCP-capable AI client can connect to this interface using its	at a second port. Any MCP-capable AI client can connect to this interface using its
built-in subscription — no separate API key is required. The client's reasoning	built-in subscription — no separate API key is required. The client's reasoning
capability combines with the gateway's organizational knowledge graph to produce	capability combines with the gateway's organizational knowledge graph to produce
responses that are grounded in the organization's actual data.	responses that are grounded in the organization's actual data.

This is the primary path for interactive use by operators who already have a	This is the primary path for interactive use by operators who already have a
subscription to an MCP-capable client. The gateway handles memory; the client	subscription to an MCP-capable client. The gateway handles memory; the client
handles reasoning.	handles reasoning.