Diff: how-to/run-local-slm-inference
From 1c02ec1 to 1c02ec1
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "How to run local SLM inference" | title: "How to run local SLM inference" |
| slug: run-local-slm-inference | slug: run-local-slm-inference |
| category: how-to | category: how-to |
| content_type: how-to | content_type: how-to |
| type: how-to | type: how-to |
| status: active | status: active |
| last_edited: 2026-06-14 | last_edited: 2026-06-14 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| paired_with: run-local-slm-inference.es.md | paired_with: run-local-slm-inference.es.md |
| --- | --- |
| The PointSav inference stack runs a small language model locally via the Doorman gateway. All inference stays on the operator's hardware — no prompt data leaves the deployment. This guide covers starting the local SLM service, verifying the Doorman health endpoint, and submitting an inference request — both from the console TUI and directly via the API. | The PointSav inference stack runs a small language model locally via the Doorman gateway. All inference stays on the operator's hardware — no prompt data leaves the deployment. This guide covers starting the local SLM service, verifying the Doorman health endpoint, and submitting an inference request — both from the console TUI and directly via the API. |
| For the inference stack architecture, see [[slm-stack-architecture]] and [[doorman-protocol]]. For the console cartridge that surfaces local inference in the TUI, see [[app-console-slm]]. | For the inference stack architecture, see [[slm-stack-architecture]] and [[doorman-protocol]]. For the console cartridge that surfaces local inference in the TUI, see [[app-console-slm]]. |
| ## Prerequisites | ## Prerequisites |
| - A deployment with the OLMo model binary installed at the expected path (see [[self-host-a-deployment]]) | - A deployment with the OLMo model binary installed at the expected path (see [[self-host-a-deployment]]) |
| - The `slm-doorman-server` service running and healthy | - The `slm-doorman-server` service running and healthy |
| - A session with at least USER-level access (see [[pair-a-new-device]]) | - A session with at least USER-level access (see [[pair-a-new-device]]) |
| ## Step 1: Start the SLM service | ## Step 1: Start the SLM service |
| If the SLM service is not already running, start it: | If the SLM service is not already running, start it: |
| ``` | ``` |
| sudo systemctl start slm-doorman-server | sudo systemctl start slm-doorman-server |
| ``` | ``` |
| Verify it started cleanly: | Verify it started cleanly: |
| ``` | ``` |
| systemctl is-active slm-doorman-server | systemctl is-active slm-doorman-server |
| journalctl -u slm-doorman-server --since "1 minute ago" | journalctl -u slm-doorman-server --since "1 minute ago" |
| ``` | ``` |
| A healthy start produces a log line indicating the model loaded and the Doorman is listening on its configured port. If the service fails to start, check the model binary path in the service configuration — the OLMo binary must be present at the path the service expects. | A healthy start produces a log line indicating the model loaded and the Doorman is listening on its configured port. If the service fails to start, check the model binary path in the service configuration — the OLMo binary must be present at the path the service expects. |
| ## Step 2: Verify Doorman health from the console | ## Step 2: Verify Doorman health from the console |
| Press **F9** in the console to open the SLM Cartridge. The Doorman health dashboard shows: | Press **F9** in the console to open the SLM Cartridge. The Doorman health dashboard shows: |
| - `A — DataGraph`: availability of the entity store (not required for pure inference) | - `A — DataGraph`: availability of the entity store (not required for pure inference) |
| - `B — SLM`: should show green once the model is loaded and Doorman is reachable | - `B — SLM`: should show green once the model is loaded and Doorman is reachable |
| - `C — Local fallback`: always available; used when Tier B is degraded | - `C — Local fallback`: always available; used when Tier B is degraded |
| Tier B must be green before inference requests will succeed. Press **R** to refresh the health status. | Tier B must be green before inference requests will succeed. Press **R** to refresh the health status. |
| ## Step 3: Submit an inference request from the console | ## Step 3: Submit an inference request from the console |
| With Tier B live, submit a prompt at the F9 input line. Type your prompt text and press Enter. The model response streams token-by-token into the output area. The status bar shows the active inference tier (`B`) during generation. | With Tier B live, submit a prompt at the F9 input line. Type your prompt text and press Enter. The model response streams token-by-token into the output area. The status bar shows the active inference tier (`B`) during generation. |
| Inference requests through the console are SYS-ADR-07-safe — no structured platform data passes through the model layer. The model receives plain prompt text only. | Inference requests through the console are SYS-ADR-07-safe — no structured platform data passes through the model layer. The model receives plain prompt text only. |
| ## Step 4: Submit an inference request directly via API | ## Step 4: Submit an inference request directly via API |
| For programmatic use, call the Doorman inference endpoint: | For programmatic use, call the Doorman inference endpoint: |
| ``` | ``` |
| curl -X POST http://127.0.0.1:<doorman-port>/v1/completions \ | curl -X POST http://127.0.0.1:<doorman-port>/v1/completions \ |
| -H "Content-Type: application/json" \ | -H "Content-Type: application/json" \ |
| -H "Authorization: Bearer <session-token>" \ | -H "Authorization: Bearer <session-token>" \ |
| -d '{ | -d '{ |
| "prompt": "Summarise the role of the Doorman gateway:", | "prompt": "Summarise the role of the Doorman gateway:", |
| "max_tokens": 200 | "max_tokens": 200 |
| }' | }' |
| ``` | ``` |
| The response is a JSON object with a `choices` array. Each choice contains the generated text. The `model` field in the response confirms which tier served the request. | The response is a JSON object with a `choices` array. Each choice contains the generated text. The `model` field in the response confirms which tier served the request. |
| ## Step 5: Check the circuit breaker state | ## Step 5: Check the circuit breaker state |
| The Doorman circuit breaker opens automatically if the SLM service becomes unresponsive. When open, all inference requests fall through to Tier C (local fallback). To check the circuit state: | The Doorman circuit breaker opens automatically if the SLM service becomes unresponsive. When open, all inference requests fall through to Tier C (local fallback). To check the circuit state: |
| ``` | ``` |
| curl http://127.0.0.1:<doorman-port>/health | curl http://127.0.0.1:<doorman-port>/health |
| ``` | ``` |
| The response includes `tier_b_state`: `CLOSED` (healthy) or `OPEN` (tripped). A tripped circuit resets after the configured cool-down period, or immediately after the SLM service recovers. | The response includes `tier_b_state`: `CLOSED` (healthy) or `OPEN` (tripped). A tripped circuit resets after the configured cool-down period, or immediately after the SLM service recovers. |
| ## Key takeaways | ## Key takeaways |
| - All inference runs on-premises; no prompt data leaves the deployment | - All inference runs on-premises; no prompt data leaves the deployment |
| - Tier B (SLM) must show green in the F9 health dashboard before inference requests succeed | - Tier B (SLM) must show green in the F9 health dashboard before inference requests succeed |
| - The Doorman circuit breaker falls back to Tier C automatically when the model is unresponsive | - The Doorman circuit breaker falls back to Tier C automatically when the model is unresponsive |
| - SYS-ADR-07 applies: do not pass structured platform data (entity records, WORM entries) through the model layer | - SYS-ADR-07 applies: do not pass structured platform data (entity records, WORM entries) through the model layer |
| ## See also | ## See also |
| - [[slm-stack-architecture]] — architecture of the local SLM stack and supported model tiers | - [[slm-stack-architecture]] — architecture of the local SLM stack and supported model tiers |
| - [[doorman-protocol]] — the Doorman gateway protocol; health, routing, and circuit-breaker behaviour | - [[doorman-protocol]] — the Doorman gateway protocol; health, routing, and circuit-breaker behaviour |
| - [[app-console-slm]] — the os-console SLM cartridge and the Doorman health dashboard | - [[app-console-slm]] — the os-console SLM cartridge and the Doorman health dashboard |
| - [[run-first-slm-query]] — submitting a query from the console once the model is running | - [[run-first-slm-query]] — submitting a query from the console once the model is running |
| - [[self-host-a-deployment]] — provision the instance that hosts the inference stack | - [[self-host-a-deployment]] — provision the instance that hosts the inference stack |