SLM and Yo-Yo operational state
TopicFrom the PointSav Documentation
How service-SLM's three-tier inference router and the Yo-Yo GPU burst VM operate, including the Doorman boundary, Tier A/B configuration, apprenticeship brief queue, and idle-shutdown cost ceiling.
service-SLM is the platform's Ring 3 component β the Optional Intelligence layer. It is a three-tier inference router that clusters and contributors use to delegate routine work: editorial polish, mechanical schema-conforming edits, bilingual translation drafts, and structured-output generation. The work is handled locally or on a dedicated GPU burst VM, without routing to a third-party API. Rings 1 and 2 (boundary ingest and knowledge processing) function fully without it; Ring 3 is structurally optional.
The Yo-Yo is the name for the platform's on-demand GPU burst instance β a GCE VM that runs a 32-billion-parameter instruction-tuned model at approximately 50-100 tokens per second. It starts on demand, shuts down after 30 minutes of inactivity, and accumulates a brief queue through its idle windows. The combination β a lightweight always-available local model on the workspace VM and a capable on-demand burst VM β defines the two active inference tiers. A third tier (external API) is configured for future use; Tier C has no active keys in the current operational period.
This document describes how service-SLM and the Yo-Yo operate in the current operational period, when the inference-substrate design was marked complete.
[edit]The Doorman boundary
Every inference request crosses the Doorman before reaching a model tier. The Doorman is a Rust binary running as systemd unit local-doorman.service, binding 127.0.0.1:9080. Its responsibilities cover the full request lifecycle:
- Hold all API keys β Tier C provider tokens and the Tier B bearer token. Keys exist nowhere else in the request path. This is the API-key boundary discipline: no key dispersal across call sites.
- Route requests to the correct tier based on complexity heuristics: request size, structured-output requirements, and audit-ledger semantics.
- Sanitise outbound requests before they reach any external API (strip workspace identifiers; rehydrate on inbound).
- Append every transit to a per-tenant audit ledger at
/var/lib/local-doorman/audit/<tenant>/<YYYY-MM>.jsonl. - Drain the apprenticeship brief queue (described below).
The /readyz endpoint returns live tier-availability flags. An example response when all tiers are operational:
{
"ready": true,
"has_local": true,
"has_yoyo": true,
"has_external": false,
"apprenticeship_enabled": true
}
[edit]Tier A β workspace VM (always available)
Tier A runs llama-server β the C++ HTTP server from llama.cpp β on the workspace VM CPU. The model is OLMo-3-1125-7B-Think-Q4_K_M.gguf, self-quantized from AllenAI's published safetensors release (Apache 2.0; sovereign supply chain). It binds 127.0.0.1:8080 as local-slm.service.
Throughput on the workspace VM (an e2-standard-4 GCE instance, CPU only) is approximately 2-3 tokens per second. This is sufficient for short briefs and trivial completions. It is not sufficient for routine editorial work at scale. The Tier A latency constraint was documented operationally and motivated the ratification of the four-tier substrate ladder.
[edit]Tier B β Yo-Yo on L4 GPU
Tier B runs llama-server with CUDA support on a separate GCE instance (yoyo-tier-b-1) in us-west1-a. The hardware is g2-standard-4: 4 vCPU, 16 GB RAM, and one NVIDIA L4 GPU with 24 GB VRAM. The model is AllenAI's published OLMo-2-0325-32B-Instruct-Q4_K_S.gguf (Apache 2.0). The instance is provisioned on-demand rather than as a spot instance β L4 spot capacity proved unreliable across multiple US zones during initial bootstrapping.
Port 8080 on the Yo-Yo VM is restricted by GCE firewall rule yoyo-tier-b-from-workspace to the workspace VM's internal IP (10.138.0.4/32) only. The Doorman holds the bearer token (SLM_YOYO_BEARER, configured in /etc/local-doorman/local-doorman.env) and authenticates every request.
Measured throughput (initial smoke test): approximately 50-100 tokens per second generation; 100 tokens per second prompt processing. A typical 500-token instruction task completes in 5-15 seconds wall-clock. Cold-start β loading the model into GPU memory β takes 60-180 seconds and is amortised across subsequent requests in the same session.
[edit]Provisioning
The Yo-Yo VM is configured from the startup provisioning script at infrastructure/yoyo-manual/startup.sh. A fresh provision reproduces the live state in approximately 30-40 minutes wall-time, across eight steps:
- Wait for cloud-init and unattended-upgrades to release the dpkg lock (up to 5 minutes on first boot).
- Install common dependencies (
curl,wget,jq,aria2,python3.12-venv). - Install CUDA toolkit, cmake, and build-essential.
- Clone llama.cpp and build
llama-serverwith-DGGML_CUDA=ONand-j 2. The-j 2constraint is intentional: unrestricted parallelism triggerscc1plusout-of-memory failures on 16 GB RAM during compilation. - Download the OLMo 2 GGUF via
aria2cwith four parallel segments and resume support. Single-stream wget proved unreliable against HuggingFace unauthenticated CDN rate-limiting;aria2c -x 4 -s 4is the documented 2026 community practice. - Generate a 64-character bearer token and write it to
/etc/yoyo-bearerwith mode 0640. - Configure the
yoyo-llama-server.servicesystemd unit. - Start the service and wait for
/healthto return{"status":"ok"}.
The bootstrap pipeline incorporates 14 distinct fixes for failure modes encountered during initial iteration: spot capacity stockouts, networking edge cases, NVIDIA driver and kernel version mismatches (driver 550 does not support kernel 6.17; the solution is a DL VM image with NVIDIA 580 and a matched kernel), 16 GB RAM compilation OOM, dpkg lock races, model hosting changes, HuggingFace CDN rate-limiting, and llama-server architecture support gaps. Each fix is documented inline in infrastructure/yoyo-manual/startup.sh. The iteration history is preserved in the workspace CHANGELOG.
[edit]The apprenticeship brief queue
Every commit on the platform triggers the post-commit capture hook. The hook writes two records:
- An engineering corpus tuple at
data/training-corpus/engineering/<scope>/<commit_sha>.jsonl(accumulating continuously). - A shadow brief at
data/apprenticeship/queue/<brief_id>.brief.jsonl(replaces an earlier HTTP fire-and-forget path that proved unreliable under network interruptions).
The Doorman runs a drain worker that polls the queue directory every 30 seconds. When a brief appears, the worker performs an atomic rename to queue-in-flight/ (dequeue), dispatches to the apprentice tier β Tier A by default, Tier B when the brief size exceeds SLM_BRIEF_TIER_B_THRESHOLD_CHARS=500 β and on completion writes a corpus tuple at data/training-corpus/apprenticeship/<task-type>/<tenant>/<brief_id>.jsonl at stage review. A reaper task reclaims expired leases (5-minute timeout) and returns briefs to the queue for retry.
This mechanism is durable across Yo-Yo idle-shutdown windows, Doorman restarts, and apprentice timeouts. The queue accumulates while the Yo-Yo VM is stopped; on restart, the drain worker processes the backlog without loss.
[edit]Cost ceiling β the idle-shutdown monitor
The Yo-Yo VM costs approximately $0.71 USD per hour while running. Always-on operation is approximately $540 USD per month. The idle-shutdown monitor at bin/yoyo-idle-monitor.sh, scheduled by yoyo-idle-monitor.timer every five minutes, keeps the monthly cost within a practical ceiling.
The monitor polls the Yo-Yo VM's /slots endpoint for active inferences. After 30 consecutive minutes of zero active inference slots, the monitor calls gcloud compute instances stop yoyo-tier-b-1. The brief queue continues to accumulate during stopped windows; the next operator-triggered start drains the backlog.
The monitor runs from the workspace VM rather than the Yo-Yo VM. The workspace VM's Compute Engine service account holds cloud-platform scope; the Yo-Yo VM's default service account does not. Running the monitor workspace-side is the simpler design given this scope difference.
At a typical development utilisation of approximately 25 percent, the idle-shutdown ceiling brings monthly spend to approximately $130-150 USD. Customers replicating this pattern operate under the same economics; the cost ceiling is a structural property of the on-demand provisioning model, not a vendor-specific optimisation.
[edit]What runs on Tier B today
The platform's engineering workflow routes routine work to Tier B: mechanical documentation updates, schema-conforming edits, pattern-based refactors, bilingual translation drafts, routine status reports, and boilerplate code. Architectural decisions, novel design, and cross-layer coordination route to the frontier-model tier. service-SLM is the multiplier for routine work; the frontier model is the engine for judgment.
[edit]What is next
Forward-looking statement: the targets in this section are planned, not committed outcomes. Actual timelines depend on apprenticeship corpus growth rate, operator availability, and model performance characteristics. [ni-51-102] [osc-sn-51-721]
It is currently planned for the apprenticeship corpus to reach 100 verdict-signed tuples in the near term, subject to commit cadence and senior-verdict throughput. It is intended that a majority of routine platform work routes through service-SLM as the drain worker accumulates a sufficient backlog of reviewed corpus tuples. The first per-cluster LoRA training cycle is planned once the corpus threshold is met, targeting the densest existing editorial collection.
When AllenAI publishes OLMo 3 32B Think or Instruct in a Q4 GGUF format, the Yo-Yo deployment is designed to swap to it via a single configuration line. Per-cluster LoRA adapters compose on top of whatever base model is current; the substrate is base-model-agnostic by design.
[edit]See also
- compounding-substrate β the five-property architectural pattern this implements
- service-slm β service-SLM service overview
- apprenticeship-substrate β how training signal accumulates from operational corpus tuples
- brief-queue-substrate β the durable queue that connects the apprenticeship brief queue to Tier A/B processing
- worm-ledger-architecture β the audit ledger that records every external call
[edit]References
- Optional Intelligence Layer β Ring 3 is structurally optional; Rings 1 and 2 function without it.
- Four-Tier SLM Substrate Ladder β Tier 0 (none) / Tier 1 (local 7B) / Tier 2 (Yo-Yo 32B vendor-hosted) / Tier 3 (PointSav-LLM, planned).
- AllenAI OLMo 3 model family. Apache 2.0 (model weights); Open Data Commons (training data). [olmo3-allenai] https://huggingface.co/allenai
- NI 51-102 Continuous Disclosure Obligations. British Columbia Securities Commission. [ni-51-102] https://www.bcsc.bc.ca/securities-law/law-and-policy/instruments-and-policies/5-ongoing-requirements-for-issuers-insiders/current/51-102
- OSC Staff Notice 51-721 Forward-Looking Information Disclosure. Ontario Securities Commission. [osc-sn-51-721] https://www.osc.ca/en/securities-law/instruments-rules-policies/5/51-721/osc-staff-notice-51-721-forward-looking-information-disclosure