SLM Rust stack architecture
TopicFrom the PointSav Documentation
The full Rust dependency graph and binary architecture for service-slm, the Doorman service that mediates every inference call in the PointSav platform.
`service-slm` is built as a single, statically-linked Rust binary. Every direct dependency in the stack is either pure Rust or Rust bindings to a permissively licensed native library. No copyleft licenses appear anywhere in the dependency graph, which means PointSav holds an unrestricted right to fork, modify, and redistribute the entire codebase. This property is called the "We Own It" criterion.
The choice of Rust is not a language preference. It is an engineering constraint imposed by the intended deployment target β ToteboxOS appliance hardware, where a CPython interpreter plus a large ML framework does not fit in the available memory envelope, and where cold-start predictability and the absence of a garbage collector are operational requirements rather than optional improvements.
[edit]Why L2 Rust, not L3 Rust
Three distinct levels of "Rust-ness" are commonly conflated:
| Level | Meaning | Achievable for service-slm? |
|---|---|---|
| L1 β Source Rust | All code written by PointSav is Rust | Yes |
| L2 β Direct-deps Rust | Every crate directly depended on is a Rust crate (may internally FFI to C/C++) | Yes |
| L3 β Transitive Rust | Every line in the entire dependency tree, including GPU kernels, is Rust | No β and not the right goal |
L3 is unachievable for GPU inference and graph databases in 2026 because CUDA kernels and columnar storage engines have a twenty-year C++ inheritance. L2 is achievable and sufficient. The "We Own It" test is a license question, not a language question: MIT + Apache-2.0 = we own it.
[edit]The canonical stack
[edit]Inference layer
The inference runtime for service-slm is mistral.rs, a statically-linked Rust binary that ships with FlashAttention V2/V3, PagedAttention, prefix caching, and LoRA hot-swap per token. It exposes an OpenAI-compatible HTTP endpoint, which is the wire protocol the Doorman uses for Tier A (local) and Tier B (GPU burst) calls.
The foundation ML framework underneath mistral.rs is candle (Apache-2.0/MIT dual license, Hugging Face). If mistral.rs ever diverges from platform requirements, candle provides a clean rebuild path without re-architecting the stack.
The OLMo 3 model family is the production base model selection. OLMo 3 carries an Apache 2.0 code license and an Open Data Commons license for training data, making it the only major open-weight family whose entire lineage β weights, training data, and code β is permissively licensed end-to-end. This is the requirement for the apprenticeship-substrate training path, where PointSav exercises the right to run continued pretraining on customer-accumulated signal.
[edit]HTTP and async runtime
The Doorman's inbound HTTP surface is served by axum (MIT), with tower middleware for retries, timeouts, and backpressure, running on the tokio async runtime (MIT). Outbound HTTP calls β to Cloud Run GPU instances (Yo-Yo), to the Tier C external API allowlist β use hyper and reqwest.
[edit]Storage and state
The audit ledger uses sqlx with an SQLite backend for local, append-only structured storage. The long-term knowledge graph (held by service-content) is held by LadybugDB, addressed through Rust bindings (MIT). Cloud object storage for model weights and LoRA adapter artefacts is abstracted through object_store (Apache-2.0).
[edit]Document processing
The document ingest path uses oxidize-pdf for PDF parsing (pure Rust, zero C dependencies, 99.3% success rate across real-world PDFs at 3,000β4,000 pages per second), docx-rust for .docx files, calamine for .xlsx spreadsheets, and pulldown-cmark for Markdown.
mupdf-rs is explicitly excluded from the dependency graph. It carries an AGPL-3.0 license, which would taint the binary if linked. The cargo-deny CI policy enforces this exclusion automatically on every commit.
[edit]Orchestration
Internal job orchestration uses apalis (MIT) β a job-processing library with step-based workflow composition and tower middleware compatibility. apalis fits the service-slm work shape: sanitise, send, await, receive, rehydrate. It introduces no Python runtime dependency, which is the meaningful distinction from Python-native workflow engines used in the `service-content` derivative pipeline.
[edit]Observability and supply-chain security
Structured logging and distributed tracing are provided by the tracing crate family (MIT), with opentelemetry-rust for OpenTelemetry export to the SOC 3 audit pipeline. Container image signing and OCI artefact attestation use sigstore-rs (Apache-2.0).
cargo-deny runs in CI on every commit and enforces license policy across the full transitive dependency tree, blocking any new dependency that introduces AGPL, GPL, LGPL, BSL, or custom community licenses. This turns manual license discipline into an automated build constraint.
[edit]Flat binary architecture
The cargo workspace produces one binary: slm-cli. Logical modules communicate via Rust function calls, not RPC. External calls β to Cloud Run, to the Mooncake KV cache sidecar, to Tier C API endpoints, to LadybugDB β are the only network boundaries.
service-slm/
βββ crates/
β βββ slm-core/ shared types, moduleId discipline
β βββ slm-doorman/ sanitise / send / receive / rehydrate
β βββ slm-ledger/ append-only audit trail (SQLite + CSV)
β βββ slm-compute/ Cloud Run driver, container management
β βββ slm-memory-kv/ LMCache + Mooncake wire protocol client
β βββ slm-memory-adapters/ LoRA adapter registry and loader
β βββ slm-inference-local/ mistral.rs-backed local inference
β βββ slm-inference-remote/ GPU burst driver
β βββ slm-api/ axum inbound endpoints
β βββ slm-cli/ binary entry point
βββ xtask/ build helpers, release automation
This is the shape a ToteboxOS appliance component requires: one process, one log stream, one set of metrics, one binary to sign with Sigstore, one configuration file.
[edit]ToteboxOS integration
The binary architecture is motivated in part by ToteboxOS deployment constraints. A CPython stack plus a GPU inference framework does not fit in the memory envelope available on constrained appliance hardware. A Rust binary with a quantised inference runtime operating in CPU mode does.
The relevant constraints per ToteboxOS Laptop-A hardware profile (~550 MB available headroom after core services):
- Static binary, no interpreter warmup β seconds, not minutes to first inference
- No garbage collector, no interpreter heap
- True parallelism across cores without a global interpreter lock
- Cross-compilation via
cargo build --target aarch64-unknown-linux-gnufor ARM ToteboxOS targets
[edit]Three external non-Rust services
Three services in the Yo-Yo compute substrate sit outside the Rust binary, all behind stable network protocols:
LMCache + Mooncake Store (Python control plane + C++ Mooncake Transfer Engine): the KV cache tier that persists prefill state across GPU node teardowns. service-slm holds a Rust client that speaks to Mooncake over HTTP and TCP. No FFI coupling. Both are Apache-2.0 licensed.
vLLM (Python): the Phase 1 trial inference engine. Replaced by mistral.rs in Phase 2. Apache-2.0.
SkyPilot (Python): multi-cloud GPU orchestration. Used when Cloud Run GPU alone is insufficient. Apache-2.0.
All three are behind stable network protocols. service-slm depends on the wire protocol, not the implementation. Swapping any of them requires changing one client module.
[edit]License hygiene
The mandatory rule for the entire dependency graph: every entry is one of MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, Unicode-DFS, MPL-2.0 (file-level), or Zlib. Anything else fails the CI build.
cargo-deny with a deny.toml policy file enforces this. The deny.toml is committed to the repository and reviewed at every dependency addition.
[edit]See also
- compounding-doorman β the operational pattern service-slm implements
- yoyo-compute-substrate β the multi-ring compute substrate service-slm drives
- apprenticeship-substrate β how the audit ledger feeds LoRA adapter training
- three-ring-architecture β Ring 3 placement of service-slm in the platform