Diff: services/service-slm-yoyo-operational
From 9106fc7 to 9106fc7
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "SLM and Yo-Yo operational state" | title: "SLM and Yo-Yo operational state" |
| slug: service-slm-yoyo-operational | slug: service-slm-yoyo-operational |
| category: services | category: services |
| type: topic | type: topic |
| quality: complete | quality: complete |
| short_description: "How service-SLM's three-tier inference router and the Yo-Yo GPU burst VM operate, including the Doorman boundary, Tier A/B configuration, apprenticeship brief queue, and idle-shutdown cost ceiling." | short_description: "How service-SLM's three-tier inference router and the Yo-Yo GPU burst VM operate, including the Doorman boundary, Tier A/B configuration, apprenticeship brief queue, and idle-shutdown cost ceiling." |
| status: active | status: active |
| bcsc_class: public-disclosure-safe | bcsc_class: public-disclosure-safe |
| last_edited: 2026-05-25 | last_edited: 2026-05-25 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| cites: | cites: |
| - ni-51-102 | - ni-51-102 |
| - osc-sn-51-721 | - osc-sn-51-721 |
| - olmo3-allenai | - olmo3-allenai |
| paired_with: service-slm-yoyo-operational.es.md | paired_with: service-slm-yoyo-operational.es.md |
| --- | --- |
| **service-SLM** is the platform's Ring 3 component — the Optional Intelligence layer. It is a three-tier inference router that clusters and contributors use to delegate routine work: editorial polish, mechanical schema-conforming edits, bilingual translation drafts, and structured-output generation. The work is handled locally or on a dedicated GPU burst VM, without routing to a third-party API. Rings 1 and 2 (boundary ingest and knowledge processing) function fully without it; Ring 3 is structurally optional. | **service-SLM** is the platform's Ring 3 component — the Optional Intelligence layer. It is a three-tier inference router that clusters and contributors use to delegate routine work: editorial polish, mechanical schema-conforming edits, bilingual translation drafts, and structured-output generation. The work is handled locally or on a dedicated GPU burst VM, without routing to a third-party API. Rings 1 and 2 (boundary ingest and knowledge processing) function fully without it; Ring 3 is structurally optional. |
| The **Yo-Yo** is the name for the platform's on-demand GPU burst instance — a GCE VM that runs a 32-billion-parameter instruction-tuned model at approximately 50-100 tokens per second. It starts on demand, shuts down after 30 minutes of inactivity, and accumulates a brief queue through its idle windows. The combination — a lightweight always-available local model on the workspace VM and a capable on-demand burst VM — defines the two active inference tiers. A third tier (external API) is configured for future use; Tier C has no active keys in the current operational period. | The **Yo-Yo** is the name for the platform's on-demand GPU burst instance — a GCE VM that runs a 32-billion-parameter instruction-tuned model at approximately 50-100 tokens per second. It starts on demand, shuts down after 30 minutes of inactivity, and accumulates a brief queue through its idle windows. The combination — a lightweight always-available local model on the workspace VM and a capable on-demand burst VM — defines the two active inference tiers. A third tier (external API) is configured for future use; Tier C has no active keys in the current operational period. |
| This document describes how service-SLM and the Yo-Yo operate in the current operational period, when the inference-substrate design was marked complete. | This document describes how service-SLM and the Yo-Yo operate in the current operational period, when the inference-substrate design was marked complete. |
| ## The Doorman boundary | ## The Doorman boundary |
| Every inference request crosses the Doorman before reaching a model tier. The Doorman is a Rust binary running as systemd unit `local-doorman.service`, binding `127.0.0.1:9080`. Its responsibilities cover the full request lifecycle: | Every inference request crosses the Doorman before reaching a model tier. The Doorman is a Rust binary running as systemd unit `local-doorman.service`, binding `127.0.0.1:9080`. Its responsibilities cover the full request lifecycle: |
| - Hold all API keys — Tier C provider tokens and the Tier B bearer token. Keys exist nowhere else in the request path. This is the API-key boundary discipline: no key dispersal across call sites. | - Hold all API keys — Tier C provider tokens and the Tier B bearer token. Keys exist nowhere else in the request path. This is the API-key boundary discipline: no key dispersal across call sites. |
| - Route requests to the correct tier based on complexity heuristics: request size, structured-output requirements, and audit-ledger semantics. | - Route requests to the correct tier based on complexity heuristics: request size, structured-output requirements, and audit-ledger semantics. |
| - Sanitise outbound requests before they reach any external API (strip workspace identifiers; rehydrate on inbound). | - Sanitise outbound requests before they reach any external API (strip workspace identifiers; rehydrate on inbound). |
| - Append every transit to a per-tenant audit ledger at `/var/lib/local-doorman/audit/<tenant>/<YYYY-MM>.jsonl`. | - Append every transit to a per-tenant audit ledger at `/var/lib/local-doorman/audit/<tenant>/<YYYY-MM>.jsonl`. |
| - Drain the apprenticeship brief queue (described below). | - Drain the apprenticeship brief queue (described below). |
| The `/readyz` endpoint returns live tier-availability flags. An example response when all tiers are operational: | The `/readyz` endpoint returns live tier-availability flags. An example response when all tiers are operational: |
| ```json | ```json |
| { | { |
| "ready": true, | "ready": true, |
| "has_local": true, | "has_local": true, |
| "has_yoyo": true, | "has_yoyo": true, |
| "has_external": false, | "has_external": false, |
| "apprenticeship_enabled": true | "apprenticeship_enabled": true |
| } | } |
| ``` | ``` |
| ## Tier A — workspace VM (always available) | ## Tier A — workspace VM (always available) |
| Tier A runs `llama-server` — the C++ HTTP server from llama.cpp — on the workspace VM CPU. The model is `OLMo-3-1125-7B-Think-Q4_K_M.gguf`, self-quantized from AllenAI's published safetensors release (Apache 2.0; sovereign supply chain). It binds `127.0.0.1:8080` as `local-slm.service`. | Tier A runs `llama-server` — the C++ HTTP server from llama.cpp — on the workspace VM CPU. The model is `OLMo-3-1125-7B-Think-Q4_K_M.gguf`, self-quantized from AllenAI's published safetensors release (Apache 2.0; sovereign supply chain). It binds `127.0.0.1:8080` as `local-slm.service`. |
| Throughput on the workspace VM (an `e2-standard-4` GCE instance, CPU only) is approximately 2-3 tokens per second. This is sufficient for short briefs and trivial completions. It is not sufficient for routine editorial work at scale. The Tier A latency constraint was documented operationally and motivated the ratification of the four-tier substrate ladder. | Throughput on the workspace VM (an `e2-standard-4` GCE instance, CPU only) is approximately 2-3 tokens per second. This is sufficient for short briefs and trivial completions. It is not sufficient for routine editorial work at scale. The Tier A latency constraint was documented operationally and motivated the ratification of the four-tier substrate ladder. |
| ## Tier B — Yo-Yo on L4 GPU | ## Tier B — Yo-Yo on L4 GPU |
| Tier B runs `llama-server` with CUDA support on a separate GCE instance (`yoyo-tier-b-1`) in `us-west1-a`. The hardware is `g2-standard-4`: 4 vCPU, 16 GB RAM, and one NVIDIA L4 GPU with 24 GB VRAM. The model is AllenAI's published `OLMo-2-0325-32B-Instruct-Q4_K_S.gguf` (Apache 2.0). The instance is provisioned on-demand rather than as a spot instance — L4 spot capacity proved unreliable across multiple US zones during initial bootstrapping. | Tier B runs `llama-server` with CUDA support on a separate GCE instance (`yoyo-tier-b-1`) in `us-west1-a`. The hardware is `g2-standard-4`: 4 vCPU, 16 GB RAM, and one NVIDIA L4 GPU with 24 GB VRAM. The model is AllenAI's published `OLMo-2-0325-32B-Instruct-Q4_K_S.gguf` (Apache 2.0). The instance is provisioned on-demand rather than as a spot instance — L4 spot capacity proved unreliable across multiple US zones during initial bootstrapping. |
| Port 8080 on the Yo-Yo VM is restricted by GCE firewall rule `yoyo-tier-b-from-workspace` to the workspace VM's internal IP (`10.138.0.4/32`) only. The Doorman holds the bearer token (`SLM_YOYO_BEARER`, configured in `/etc/local-doorman/local-doorman.env`) and authenticates every request. | Port 8080 on the Yo-Yo VM is restricted by GCE firewall rule `yoyo-tier-b-from-workspace` to the workspace VM's internal IP (`10.138.0.4/32`) only. The Doorman holds the bearer token (`SLM_YOYO_BEARER`, configured in `/etc/local-doorman/local-doorman.env`) and authenticates every request. |
| Measured throughput (initial smoke test): approximately 50-100 tokens per second generation; 100 tokens per second prompt processing. A typical 500-token instruction task completes in 5-15 seconds wall-clock. Cold-start — loading the model into GPU memory — takes 60-180 seconds and is amortised across subsequent requests in the same session. | Measured throughput (initial smoke test): approximately 50-100 tokens per second generation; 100 tokens per second prompt processing. A typical 500-token instruction task completes in 5-15 seconds wall-clock. Cold-start — loading the model into GPU memory — takes 60-180 seconds and is amortised across subsequent requests in the same session. |
| ### Provisioning | ### Provisioning |
| The Yo-Yo VM is configured from the startup provisioning script at `infrastructure/yoyo-manual/startup.sh`. A fresh provision reproduces the live state in approximately 30-40 minutes wall-time, across eight steps: | The Yo-Yo VM is configured from the startup provisioning script at `infrastructure/yoyo-manual/startup.sh`. A fresh provision reproduces the live state in approximately 30-40 minutes wall-time, across eight steps: |
| 1. Wait for cloud-init and unattended-upgrades to release the dpkg lock (up to 5 minutes on first boot). | 1. Wait for cloud-init and unattended-upgrades to release the dpkg lock (up to 5 minutes on first boot). |
| 2. Install common dependencies (`curl`, `wget`, `jq`, `aria2`, `python3.12-venv`). | 2. Install common dependencies (`curl`, `wget`, `jq`, `aria2`, `python3.12-venv`). |
| 3. Install CUDA toolkit, cmake, and build-essential. | 3. Install CUDA toolkit, cmake, and build-essential. |
| 4. Clone llama.cpp and build `llama-server` with `-DGGML_CUDA=ON` and `-j 2`. The `-j 2` constraint is intentional: unrestricted parallelism triggers `cc1plus` out-of-memory failures on 16 GB RAM during compilation. | 4. Clone llama.cpp and build `llama-server` with `-DGGML_CUDA=ON` and `-j 2`. The `-j 2` constraint is intentional: unrestricted parallelism triggers `cc1plus` out-of-memory failures on 16 GB RAM during compilation. |
| 5. Download the OLMo 2 GGUF via `aria2c` with four parallel segments and resume support. Single-stream wget proved unreliable against HuggingFace unauthenticated CDN rate-limiting; `aria2c -x 4 -s 4` is the documented 2026 community practice. | 5. Download the OLMo 2 GGUF via `aria2c` with four parallel segments and resume support. Single-stream wget proved unreliable against HuggingFace unauthenticated CDN rate-limiting; `aria2c -x 4 -s 4` is the documented 2026 community practice. |
| 6. Generate a 64-character bearer token and write it to `/etc/yoyo-bearer` with mode 0640. | 6. Generate a 64-character bearer token and write it to `/etc/yoyo-bearer` with mode 0640. |
| 7. Configure the `yoyo-llama-server.service` systemd unit. | 7. Configure the `yoyo-llama-server.service` systemd unit. |
| 8. Start the service and wait for `/health` to return `{"status":"ok"}`. | 8. Start the service and wait for `/health` to return `{"status":"ok"}`. |
| The bootstrap pipeline incorporates 14 distinct fixes for failure modes encountered during initial iteration: spot capacity stockouts, networking edge cases, NVIDIA driver and kernel version mismatches (driver 550 does not support kernel 6.17; the solution is a DL VM image with NVIDIA 580 and a matched kernel), 16 GB RAM compilation OOM, dpkg lock races, model hosting changes, HuggingFace CDN rate-limiting, and llama-server architecture support gaps. Each fix is documented inline in `infrastructure/yoyo-manual/startup.sh`. The iteration history is preserved in the workspace CHANGELOG. | The bootstrap pipeline incorporates 14 distinct fixes for failure modes encountered during initial iteration: spot capacity stockouts, networking edge cases, NVIDIA driver and kernel version mismatches (driver 550 does not support kernel 6.17; the solution is a DL VM image with NVIDIA 580 and a matched kernel), 16 GB RAM compilation OOM, dpkg lock races, model hosting changes, HuggingFace CDN rate-limiting, and llama-server architecture support gaps. Each fix is documented inline in `infrastructure/yoyo-manual/startup.sh`. The iteration history is preserved in the workspace CHANGELOG. |
| ## The apprenticeship brief queue | ## The apprenticeship brief queue |
| Every commit on the platform triggers the post-commit capture hook. The hook writes two records: | Every commit on the platform triggers the post-commit capture hook. The hook writes two records: |
| - An engineering corpus tuple at `data/training-corpus/engineering/<scope>/<commit_sha>.jsonl` (accumulating continuously). | - An engineering corpus tuple at `data/training-corpus/engineering/<scope>/<commit_sha>.jsonl` (accumulating continuously). |
| - A shadow brief at `data/apprenticeship/queue/<brief_id>.brief.jsonl` (replaces an earlier HTTP fire-and-forget path that proved unreliable under network interruptions). | - A shadow brief at `data/apprenticeship/queue/<brief_id>.brief.jsonl` (replaces an earlier HTTP fire-and-forget path that proved unreliable under network interruptions). |
| The Doorman runs a drain worker that polls the queue directory every 30 seconds. When a brief appears, the worker performs an atomic rename to `queue-in-flight/` (dequeue), dispatches to the apprentice tier — Tier A by default, Tier B when the brief size exceeds `SLM_BRIEF_TIER_B_THRESHOLD_CHARS=500` — and on completion writes a corpus tuple at `data/training-corpus/apprenticeship/<task-type>/<tenant>/<brief_id>.jsonl` at stage `review`. A reaper task reclaims expired leases (5-minute timeout) and returns briefs to the queue for retry. | The Doorman runs a drain worker that polls the queue directory every 30 seconds. When a brief appears, the worker performs an atomic rename to `queue-in-flight/` (dequeue), dispatches to the apprentice tier — Tier A by default, Tier B when the brief size exceeds `SLM_BRIEF_TIER_B_THRESHOLD_CHARS=500` — and on completion writes a corpus tuple at `data/training-corpus/apprenticeship/<task-type>/<tenant>/<brief_id>.jsonl` at stage `review`. A reaper task reclaims expired leases (5-minute timeout) and returns briefs to the queue for retry. |
| This mechanism is durable across Yo-Yo idle-shutdown windows, Doorman restarts, and apprentice timeouts. The queue accumulates while the Yo-Yo VM is stopped; on restart, the drain worker processes the backlog without loss. | This mechanism is durable across Yo-Yo idle-shutdown windows, Doorman restarts, and apprentice timeouts. The queue accumulates while the Yo-Yo VM is stopped; on restart, the drain worker processes the backlog without loss. |
| ## Cost ceiling — the idle-shutdown monitor | ## Cost ceiling — the idle-shutdown monitor |
| The Yo-Yo VM costs approximately $0.71 USD per hour while running. Always-on operation is approximately $540 USD per month. The idle-shutdown monitor at `bin/yoyo-idle-monitor.sh`, scheduled by `yoyo-idle-monitor.timer` every five minutes, keeps the monthly cost within a practical ceiling. | The Yo-Yo VM costs approximately $0.71 USD per hour while running. Always-on operation is approximately $540 USD per month. The idle-shutdown monitor at `bin/yoyo-idle-monitor.sh`, scheduled by `yoyo-idle-monitor.timer` every five minutes, keeps the monthly cost within a practical ceiling. |
| The monitor polls the Yo-Yo VM's `/slots` endpoint for active inferences. After 30 consecutive minutes of zero active inference slots, the monitor calls `gcloud compute instances stop yoyo-tier-b-1`. The brief queue continues to accumulate during stopped windows; the next operator-triggered start drains the backlog. | The monitor polls the Yo-Yo VM's `/slots` endpoint for active inferences. After 30 consecutive minutes of zero active inference slots, the monitor calls `gcloud compute instances stop yoyo-tier-b-1`. The brief queue continues to accumulate during stopped windows; the next operator-triggered start drains the backlog. |
| The monitor runs from the workspace VM rather than the Yo-Yo VM. The workspace VM's Compute Engine service account holds `cloud-platform` scope; the Yo-Yo VM's default service account does not. Running the monitor workspace-side is the simpler design given this scope difference. | The monitor runs from the workspace VM rather than the Yo-Yo VM. The workspace VM's Compute Engine service account holds `cloud-platform` scope; the Yo-Yo VM's default service account does not. Running the monitor workspace-side is the simpler design given this scope difference. |
| At a typical development utilisation of approximately 25 percent, the idle-shutdown ceiling brings monthly spend to approximately $130-150 USD. Customers replicating this pattern operate under the same economics; the cost ceiling is a structural property of the on-demand provisioning model, not a vendor-specific optimisation. | At a typical development utilisation of approximately 25 percent, the idle-shutdown ceiling brings monthly spend to approximately $130-150 USD. Customers replicating this pattern operate under the same economics; the cost ceiling is a structural property of the on-demand provisioning model, not a vendor-specific optimisation. |
| ## What runs on Tier B today | ## What runs on Tier B today |
| The platform's engineering workflow routes routine work to Tier B: mechanical documentation updates, schema-conforming edits, pattern-based refactors, bilingual translation drafts, routine status reports, and boilerplate code. Architectural decisions, novel design, and cross-layer coordination route to the frontier-model tier. service-SLM is the multiplier for routine work; the frontier model is the engine for judgment. | The platform's engineering workflow routes routine work to Tier B: mechanical documentation updates, schema-conforming edits, pattern-based refactors, bilingual translation drafts, routine status reports, and boilerplate code. Architectural decisions, novel design, and cross-layer coordination route to the frontier-model tier. service-SLM is the multiplier for routine work; the frontier model is the engine for judgment. |
| ## What is next | ## What is next |
| *Forward-looking statement: the targets in this section are planned, not committed outcomes. Actual timelines depend on apprenticeship corpus growth rate, operator availability, and model performance characteristics. [ni-51-102] [osc-sn-51-721]* | *Forward-looking statement: the targets in this section are planned, not committed outcomes. Actual timelines depend on apprenticeship corpus growth rate, operator availability, and model performance characteristics. [ni-51-102] [osc-sn-51-721]* |
| It is currently planned for the apprenticeship corpus to reach 100 verdict-signed tuples in the near term, subject to commit cadence and senior-verdict throughput. It is intended that a majority of routine platform work routes through service-SLM as the drain worker accumulates a sufficient backlog of reviewed corpus tuples. The first per-cluster LoRA training cycle is planned once the corpus threshold is met, targeting the densest existing editorial collection. | It is currently planned for the apprenticeship corpus to reach 100 verdict-signed tuples in the near term, subject to commit cadence and senior-verdict throughput. It is intended that a majority of routine platform work routes through service-SLM as the drain worker accumulates a sufficient backlog of reviewed corpus tuples. The first per-cluster LoRA training cycle is planned once the corpus threshold is met, targeting the densest existing editorial collection. |
| When AllenAI publishes OLMo 3 32B Think or Instruct in a Q4 GGUF format, the Yo-Yo deployment is designed to swap to it via a single configuration line. Per-cluster LoRA adapters compose on top of whatever base model is current; the substrate is base-model-agnostic by design. | When AllenAI publishes OLMo 3 32B Think or Instruct in a Q4 GGUF format, the Yo-Yo deployment is designed to swap to it via a single configuration line. Per-cluster LoRA adapters compose on top of whatever base model is current; the substrate is base-model-agnostic by design. |
| ## See also | ## See also |
| - [[compounding-substrate]] — the five-property architectural pattern this implements | - [[compounding-substrate]] — the five-property architectural pattern this implements |
| - [[service-slm]] — service-SLM service overview | - [[service-slm]] — service-SLM service overview |
| - [[apprenticeship-substrate]] — how training signal accumulates from operational corpus tuples | - [[apprenticeship-substrate]] — how training signal accumulates from operational corpus tuples |
| - [[brief-queue-substrate]] — the durable queue that connects the apprenticeship brief queue to Tier A/B processing | - [[brief-queue-substrate]] — the durable queue that connects the apprenticeship brief queue to Tier A/B processing |
| - [[worm-ledger-architecture]] — the audit ledger that records every external call | - [[worm-ledger-architecture]] — the audit ledger that records every external call |
| ## References | ## References |
| 1. Optional Intelligence Layer — Ring 3 is structurally optional; Rings 1 and 2 function without it. | 1. Optional Intelligence Layer — Ring 3 is structurally optional; Rings 1 and 2 function without it. |
| 2. Four-Tier SLM Substrate Ladder — Tier 0 (none) / Tier 1 (local 7B) / Tier 2 (Yo-Yo 32B vendor-hosted) / Tier 3 (PointSav-LLM, planned). | 2. Four-Tier SLM Substrate Ladder — Tier 0 (none) / Tier 1 (local 7B) / Tier 2 (Yo-Yo 32B vendor-hosted) / Tier 3 (PointSav-LLM, planned). |
| 3. AllenAI OLMo 3 model family. Apache 2.0 (model weights); Open Data Commons (training data). [olmo3-allenai] https://huggingface.co/allenai | 3. AllenAI OLMo 3 model family. Apache 2.0 (model weights); Open Data Commons (training data). [olmo3-allenai] https://huggingface.co/allenai |
| 4. NI 51-102 Continuous Disclosure Obligations. British Columbia Securities Commission. [ni-51-102] https://www.bcsc.bc.ca/securities-law/law-and-policy/instruments-and-policies/5-ongoing-requirements-for-issuers-insiders/current/51-102 | 4. NI 51-102 Continuous Disclosure Obligations. British Columbia Securities Commission. [ni-51-102] https://www.bcsc.bc.ca/securities-law/law-and-policy/instruments-and-policies/5-ongoing-requirements-for-issuers-insiders/current/51-102 |
| 5. OSC Staff Notice 51-721 Forward-Looking Information Disclosure. Ontario Securities Commission. [osc-sn-51-721] https://www.osc.ca/en/securities-law/instruments-rules-policies/5/51-721/osc-staff-notice-51-721-forward-looking-information-disclosure | 5. OSC Staff Notice 51-721 Forward-Looking Information Disclosure. Ontario Securities Commission. [osc-sn-51-721] https://www.osc.ca/en/securities-law/instruments-rules-policies/5/51-721/osc-staff-notice-51-721-forward-looking-information-disclosure |