Diff: architecture/elastic-compute-lora-training-pipeline

From b67c6ba to b67c6ba

+153 / −0 lines

Before	After
	---
	schema: foundry-doc-v1
	title: "Elastic Compute #1 nightly LoRA training pipeline"
	slug: elastic-compute-lora-training-pipeline
	category: architecture
	type: concept
	quality: pre-build
	status: pre-build
	audience: vendor-public
	bcsc_class: current-fact
	language_protocol: PROSE-TOPIC
	last_edited: 2026-05-24
	editor: pointsav-engineering
	paired_with: elastic-compute-lora-training-pipeline.es.md
	short_description: "Elastic Compute #1 runs a nightly two-phase pipeline that rebuilds the deployment DataGraph and produces LoRA adapter weights for the workspace language model. Phase 1 uses a 32B inference model for entity extraction; Phase 2 trains a 7B parameter-efficient adapter from engineering and apprenticeship corpora."
	cites: []
	---

	Elastic Compute #1 is a g2-standard-4 Google Cloud spot instance equipped with a
	single NVIDIA L4 GPU (24 GB VRAM). Each night it runs a two-phase, four-hour
	pipeline that produces fine-tuned adapter weights for the workspace language
	model. Phase 1 extracts structured business entities from the operator data
	corpus and writes them to a property graph. Phase 2 reads accumulated
	engineering and apprenticeship training tuples, checks whether the corpus
	has crossed a minimum threshold, and runs a parameter-efficient training
	pass against the base model. The two phases are mandatory and sequential —
	they cannot overlap because both require exclusive access to the L4 GPU.

	## Why the phases are separate

	The L4 GPU serves two incompatible workloads within the nightly window.
	During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run entity
	extraction inference. During Phase 2, the QLoRA training loop loads OLMo 3
	7B Think safetensors for gradient computation. A GPU cannot serve an active
	vLLM inference process and a PyTorch training loop simultaneously — memory
	addresses conflict and context switching between CUDA kernels at this scale
	is not supported. `nightly-run.sh` enforces the boundary explicitly:
	Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM process and frees
	the GPU before Phase 2 begins. Each phase has a configurable budget,
	defaulting to 7200 seconds (two hours) each.

	## Phase 1 — DataGraph rebuild

	At the start of the nightly window, `start-yoyo.sh` boots the Elastic Compute #1 VM
	and waits up to 90 minutes for vLLM to signal readiness. Once the inference
	server is live, `jennifer-datagraph-rebuild.sh` processes three document
	streams from the operator deployment: meeting transcript markdown files,
	agent research YAML and markdown files, and contact source JSON records.
	For each document, the script calls `POST :9080/v1/chat/completions` through
	the Doorman, which routes the payload to the 32B Think model on the Elastic Compute VM.
	The model returns a structured JSON array of named entities — people,
	companies, projects, accounts, and locations — constrained by a JSON Schema
	grammar so the output is machine-parseable without post-processing. The
	script then calls `POST :9081/v1/graph/mutate` on service-content to write
	those entities into LadybugDB. A local ledger of processed document hashes
	ensures each document is processed exactly once across multiple nightly runs.

	At the end of Phase 1, vLLM stops and the GPU is released.

	## Phase 2 — Adapter training

	`corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples
	in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from
	cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs
	drawn from the apprenticeship routing substrate). When either bucket reaches
	50 tuples, the script writes a training-pending marker file and, if the
	`SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the
	relevant corpus directory to the configured GCS bucket.

	On the Elastic Compute VM, `lora-training.sh` polls the training-pending directory
	every 30 seconds. When a marker appears, it claims the marker with an atomic
	rename (appending `.claimed`), pulls the corpus from GCS, and runs QLoRA
	using the peft, bitsandbytes, and trl libraries.

	## What QLoRA is

	QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning
	method that loads a base model in 4-bit NF4 quantisation and trains a small
	set of additional weight matrices — called an adapter — rather than updating
	the full model. For a 7B-parameter model like OLMo 3 7B Think, 4-bit
	quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to
	approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the
	training loop itself. The adapter targets seven linear projection layers:
	`q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and
	`down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32
	(`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient
	checkpointing enabled to manage activation memory.

	The training configuration is intentionally conservative. The goal is to
	shift the base model toward the vocabulary, formatting patterns, and
	structural conventions that appear in the engineering and apprenticeship
	corpora — not to retrain the model on a general task. Two epochs over
	hundreds of tuples is sufficient for this narrow shift.

	## The two corpus streams

	Engineering tuples are SFT (supervised fine-tuning) pairs drawn from
	actual commit diffs, commit messages, and review briefs across all clusters
	in the workspace. They teach the model the precise technical vocabulary and
	structural patterns used in the engineering workflow: how diffs are described,
	how review comments are phrased, and how implementation decisions are
	documented.

	Apprenticeship tuples are DPO (direct preference optimisation) pairs
	produced by the apprenticeship routing substrate. Each pair consists of a
	shadow response (the model's unguided output) and a verdict response (the
	preferred formulation confirmed by the operator). DPO training on these pairs
	moves the model toward the preferred response distribution without requiring
	explicit labels for every token.

	## Adapter output and publication

	When training completes, the adapter is saved to
	`/data/weights/adapters/<tenant>/<role>/v<N>/` on the Elastic Compute VM. The adapter
	directory contains the LoRA weight files and tokenizer configuration — total
	size is typically 1 to 3 GB. `lora-training.sh` then signals
	`adapter-publish.service`, which uploads the adapter directory to the
	configured GCS bucket. The adapter is subsequently available to the workspace
	Doorman for loading as an inference-time weight overlay on the base model.
	The marker file is renamed to `.completed` when all steps succeed.

	## Adapter training versus continued pre-training

	The nightly LoRA process is adapter training. It produces a weight delta —
	a few gigabytes of parameters — that the base model loads at inference time.
	It runs in approximately two hours on a single L4 GPU and operates over
	hundreds to low thousands of training tuples. The base model itself is not
	modified.

	Continued pre-training (CPT) is a distinct operation at a fundamentally
	different scale. CPT would produce a new base model checkpoint by training
	on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four
	weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is
	operator-triggered, never automated, and never scheduled as part of the
	nightly pipeline. The first-cut CPT target is Q1 2027, contingent on
	corpus volume and operator decision. Until that decision is made, all
	nightly training is adapter-only.

	## Current status

	The nightly pipeline code is complete. The workspace language model service
	passes 177 of 177 tests. The Packer image rebuild that bakes the training
	Python stack (peft, bitsandbytes, trl) into the Elastic Compute VM is the next
	intended operator action. Once that image is deployed, `lora-training.service`
	on the Elastic Compute VM will be enabled with `systemctl enable --now lora-training.service`.
	Until the image is rebuilt, the training phase runs in marker-only mode:
	`corpus-threshold.py` writes and dispatches the GCS marker, but
	`lora-training.sh` is not yet active on the runtime VM image.

	## See also

	- [[service-slm-graph-store-migration]] — the DataGraph rebuild that forms Phase 1 of this pipeline
	- [[service-slm]] — the service that orchestrates both phases