Skip to content

Diff: substrate/soft-slm-tiered-gateway.es

From 1c02ec1 to 1c02ec1

+0 / −0 lines
BeforeAfter
--- ---
schema: foundry-doc-v1 schema: foundry-doc-v1
title: "The tiered inference gateway — local-first AI routing" title: "The tiered inference gateway — local-first AI routing"
slug: soft-slm-tiered-gateway slug: soft-slm-tiered-gateway
category: substrate category: substrate
type: topic type: topic
content_type: topic content_type: topic
quality: complete quality: complete
short_description: "A tiered inference gateway that routes AI requests through a local model first, escalating to remote GPU nodes and external APIs only when the local tier cannot serve — minimizing latency, cost, and data exposure while preserving full capability on demand." short_description: "A tiered inference gateway that routes AI requests through a local model first, escalating to remote GPU nodes and external APIs only when the local tier cannot serve — minimizing latency, cost, and data exposure while preserving full capability on demand."
status: active status: active
bcsc_class: public-disclosure-safe bcsc_class: public-disclosure-safe
last_edited: 2026-06-09 last_edited: 2026-06-09
editor: pointsav-engineering editor: pointsav-engineering
cites: [] cites: []
references: [] references: []
paired_with: soft-slm-tiered-gateway.es.md paired_with: soft-slm-tiered-gateway.es.md
supersedes: slm-tiered-substrate.md supersedes: slm-tiered-substrate.md
--- ---
# The tiered inference gateway — local-first AI routing # The tiered inference gateway — local-first AI routing
A tiered inference gateway routes every AI request through a hierarchy of compute A tiered inference gateway routes every AI request through a hierarchy of compute
tiers, selecting the least expensive capable tier for each request. Routine work runs tiers, selecting the least expensive capable tier for each request. Routine work runs
on hardware the organization owns. Burst capacity on rented GPU handles work that on hardware the organization owns. Burst capacity on rented GPU handles work that
exceeds local capability. An external commercial API provides a final fallback. Each exceeds local capability. An external commercial API provides a final fallback. Each
tier degrades gracefully to the one below it; no tier is a single point of failure. tier degrades gracefully to the one below it; no tier is a single point of failure.
## Why local-first matters ## Why local-first matters
Routing all inference to an external service is simple to operate but has structural Routing all inference to an external service is simple to operate but has structural
costs. Every request crosses an organizational boundary, exposing the content of costs. Every request crosses an organizational boundary, exposing the content of
prompts and responses to a third-party provider. Cost is proportional to usage with prompts and responses to a third-party provider. Cost is proportional to usage with
no amortization. The organization has no way to adapt the model to its own vocabulary, no amortization. The organization has no way to adapt the model to its own vocabulary,
processes, or accumulated knowledge. processes, or accumulated knowledge.
A local-first gateway eliminates these costs for the majority of work. The local model A local-first gateway eliminates these costs for the majority of work. The local model
handles requests that fall within its capability. External resources handle what it handles requests that fall within its capability. External resources handle what it
cannot. Over time, the local model improves through usage-derived training, narrowing cannot. Over time, the local model improves through usage-derived training, narrowing
the set of requests that require external compute. the set of requests that require external compute.
## The three tiers ## The three tiers
### Tier A — local inference ### Tier A — local inference
Tier A is an inference server running on the organization's own hardware. It is Tier A is an inference server running on the organization's own hardware. It is
always running, produces responses in seconds, and costs nothing per request beyond always running, produces responses in seconds, and costs nothing per request beyond
the amortized hardware. It is the default destination for all requests. the amortized hardware. It is the default destination for all requests.
The local model is purpose-trained or fine-tuned for the organization's domain. The local model is purpose-trained or fine-tuned for the organization's domain.
It is smaller and faster than models at higher tiers. It answers the majority of It is smaller and faster than models at higher tiers. It answers the majority of
routine requests competently: summarization, classification, entity extraction from routine requests competently: summarization, classification, entity extraction from
known document types, code generation in known patterns. known document types, code generation in known patterns.
Tier A does not handle grammar-constrained structured output well at small model Tier A does not handle grammar-constrained structured output well at small model
sizes. Requests requiring precise JSON schema enforcement are routed to Tier B. sizes. Requests requiring precise JSON schema enforcement are routed to Tier B.
### Tier B — burst GPU node ### Tier B — burst GPU node
Tier B is one or more remote inference nodes running a larger, more capable model Tier B is one or more remote inference nodes running a larger, more capable model
on dedicated GPU hardware. Nodes start on demand and stop when idle, so the cost is on dedicated GPU hardware. Nodes start on demand and stop when idle, so the cost is
proportional to actual use rather than availability. proportional to actual use rather than availability.
The gateway maintains a per-node circuit breaker and a VM lifecycle state machine. The gateway maintains a per-node circuit breaker and a VM lifecycle state machine.
When a Tier B request arrives and the target node is stopped, the gateway starts it When a Tier B request arrives and the target node is stopped, the gateway starts it
automatically. The caller receives a 202 Accepted response with a polling endpoint automatically. The caller receives a 202 Accepted response with a polling endpoint
while the node boots. Once the node is ready, the request is served. while the node boots. Once the node is ready, the request is served.
When the node is running, requests are dispatched immediately. When the circuit When the node is running, requests are dispatched immediately. When the circuit
breaker opens — after consecutive health probe failures — requests fall back to breaker opens — after consecutive health probe failures — requests fall back to
Tier A or queue until the node recovers. Tier A or queue until the node recovers.
Tier B nodes are organized by label. A `batch` label handles background work: corpus Tier B nodes are organized by label. A `batch` label handles background work: corpus
extraction, training data processing, scheduled maintenance. An `express` label handles extraction, training data processing, scheduled maintenance. An `express` label handles
time-sensitive work that cannot wait for a cold start. time-sensitive work that cannot wait for a cold start.
### Tier C — external inference provider ### Tier C — external inference provider
Tier C is an optional connection to a commercial language model API. It serves as a Tier C is an optional connection to a commercial language model API. It serves as a
final fallback when both Tier A and Tier B are unavailable, and as a direct route for final fallback when both Tier A and Tier B are unavailable, and as a direct route for
tasks the organization has explicitly designated as external. tasks the organization has explicitly designated as external.
Tier C is never used as a source of training data. An organization that routes Tier C is never used as a source of training data. An organization that routes
inference to an external provider and then fine-tunes on those responses is inference to an external provider and then fine-tunes on those responses is
building a dependency on a third party's output quality and terms of service. building a dependency on a third party's output quality and terms of service.
The boundary is enforced in the gateway's routing logic. The boundary is enforced in the gateway's routing logic.
Tier C requires an explicit API key to activate. Without the key, requests that Tier C requires an explicit API key to activate. Without the key, requests that
reach Tier C fall back to Tier A. reach Tier C fall back to Tier A.
## Request routing ## Request routing
Every request carries a complexity hint and, optionally, a tier label. The gateway Every request carries a complexity hint and, optionally, a tier label. The gateway
selects the tier using this decision sequence: selects the tier using this decision sequence:
1. If a kill switch is closed for the requested tier, the request is rejected or 1. If a kill switch is closed for the requested tier, the request is rejected or
falls to the next tier depending on configuration. falls to the next tier depending on configuration.
2. If an explicit tier label is present, the request is routed to that tier. 2. If an explicit tier label is present, the request is routed to that tier.
3. If no label is present, the routing policy applies: 3. If no label is present, the routing policy applies:
- `balanced`: low and medium complexity → Tier A; high complexity → Tier B. - `balanced`: low and medium complexity → Tier A; high complexity → Tier B.
- `drain-batch`: all non-express work routes to the batch node. - `drain-batch`: all non-express work routes to the batch node.
- `drain-express`: all work routes to the express node to clear a backlog. - `drain-express`: all work routes to the express node to clear a backlog.
- `local-only`: all work routes to Tier A regardless of complexity. - `local-only`: all work routes to Tier A regardless of complexity.
4. If the selected tier is unavailable, the request falls to the next tier unless 4. If the selected tier is unavailable, the request falls to the next tier unless
tier affinity is required (for example, structured extraction requires Tier B and tier affinity is required (for example, structured extraction requires Tier B and
does not fall back to Tier A). does not fall back to Tier A).
The routing policy is configurable at runtime without restarting the gateway: The routing policy is configurable at runtime without restarting the gateway:
``` ```
POST /v1/flow/policy { "policy": "balanced" } POST /v1/flow/policy { "policy": "balanced" }
``` ```
## The kill switch ## The kill switch
Every tier has an independent kill switch. Closing a kill switch stops all new Every tier has an independent kill switch. Closing a kill switch stops all new
dispatching to that tier immediately. In-flight requests complete; no new requests dispatching to that tier immediately. In-flight requests complete; no new requests
start. Queued work accumulates and drains when the kill switch is reopened. start. Queued work accumulates and drains when the kill switch is reopened.
The kill switch is the operator's billing control. Closing the express node switch The kill switch is the operator's billing control. Closing the express node switch
stops the A100 from starting; the cost drops to zero. Closing the global switch stops the A100 from starting; the cost drops to zero. Closing the global switch
stops all Tier B and Tier C spending while allowing Tier A to continue serving. stops all Tier B and Tier C spending while allowing Tier A to continue serving.
The express lane — which bypasses the file-backed queue for time-sensitive work — The express lane — which bypasses the file-backed queue for time-sensitive work —
still checks the kill switch. Nothing bypasses the kill switch. still checks the kill switch. Nothing bypasses the kill switch.
## The priority queue ## The priority queue
Background work — apprenticeship briefs, corpus extraction, training corpus generation Background work — apprenticeship briefs, corpus extraction, training corpus generation
— is processed through a file-backed priority queue with three levels: — is processed through a file-backed priority queue with three levels:
- **P0** routes exclusively to the local model for lightweight classification. - **P0** routes exclusively to the local model for lightweight classification.
- **P1** routes to the batch GPU node for extraction work requiring structured output. - **P1** routes to the batch GPU node for extraction work requiring structured output.
- **P2** routes to the batch GPU node for training corpus generation and similar - **P2** routes to the batch GPU node for training corpus generation and similar
long-running background tasks. long-running background tasks.
The queue drain worker processes one item from each level per cycle, in P0 → P1 → P2 The queue drain worker processes one item from each level per cycle, in P0 → P1 → P2
order, then repeats. This prevents a large batch of P2 work from blocking P1 order, then repeats. This prevents a large batch of P2 work from blocking P1
extraction tasks for an extended period. extraction tasks for an extended period.
## Organizational memory context ## Organizational memory context
Before dispatching any request to any tier, the gateway queries the organizational Before dispatching any request to any tier, the gateway queries the organizational
knowledge graph for entities relevant to the current request. Matching entities are knowledge graph for entities relevant to the current request. Matching entities are
injected into the system prompt as structured context. The model sees the injected into the system prompt as structured context. The model sees the
organization's known relationships, decisions, and policies without those facts organization's known relationships, decisions, and policies without those facts
needing to be re-derived from inference. needing to be re-derived from inference.
This context injection is non-fatal: if the graph service is unavailable, the request This context injection is non-fatal: if the graph service is unavailable, the request
proceeds without context. A circuit breaker on the graph query path prevents a proceeds without context. A circuit breaker on the graph query path prevents a
slow graph service from blocking inference. slow graph service from blocking inference.
## The MCP server ## The MCP server
The gateway exposes an organizational memory interface via the Model Context Protocol The gateway exposes an organizational memory interface via the Model Context Protocol
at a second port. Any MCP-capable AI client can connect to this interface using its at a second port. Any MCP-capable AI client can connect to this interface using its
built-in subscription — no separate API key is required. The client's reasoning built-in subscription — no separate API key is required. The client's reasoning
capability combines with the gateway's organizational knowledge graph to produce capability combines with the gateway's organizational knowledge graph to produce
responses that are grounded in the organization's actual data. responses that are grounded in the organization's actual data.
This is the primary path for interactive use by operators who already have a This is the primary path for interactive use by operators who already have a
subscription to an MCP-capable client. The gateway handles memory; the client subscription to an MCP-capable client. The gateway handles memory; the client
handles reasoning. handles reasoning.