Diff: substrate/yo-yo-lora-training-pipeline.es

From a16a78a to a16a78a

+0 / −0 lines

Before	After
---	---
schema: foundry-doc-v1	schema: foundry-doc-v1
title: "Yo-yo #1 nightly LoRA training pipeline"	title: "Yo-yo #1 nightly LoRA training pipeline"
slug: yo-yo-lora-training-pipeline	slug: yo-yo-lora-training-pipeline
category: substrate	category: substrate
type: topic	type: topic
quality: complete	quality: complete
short_description: "The nightly two-phase pipeline on Yo-Yo #1: Phase 1 runs entity extraction for the business DataGraph; Phase 2 trains a LoRA adapter against engineering and apprenticeship corpora using QLoRA on a single L4 GPU."	short_description: "The nightly two-phase pipeline on Yo-Yo #1: Phase 1 runs entity extraction for the business DataGraph; Phase 2 trains a LoRA adapter against engineering and apprenticeship corpora using QLoRA on a single L4 GPU."
status: active	status: active
bcsc_class: public-disclosure-safe	bcsc_class: public-disclosure-safe
last_edited: 2026-05-15	last_edited: 2026-05-15
editor: pointsav-engineering	editor: pointsav-engineering
cites: []	cites: []
references:	references:
- id: 1	- id: 1
text: "Dettmers, T. et al. 'QLoRA: Efficient Finetuning of Quantized LLMs.' NeurIPS, 2023."	text: "Dettmers, T. et al. 'QLoRA: Efficient Finetuning of Quantized LLMs.' NeurIPS, 2023."
url: "https://arxiv.org/abs/2305.14314"	url: "https://arxiv.org/abs/2305.14314"
- id: 2	- id: 2
text: "Hu, E. et al. 'LoRA: Low-Rank Adaptation of Large Language Models.' ICLR, 2022."	text: "Hu, E. et al. 'LoRA: Low-Rank Adaptation of Large Language Models.' ICLR, 2022."
url: "https://arxiv.org/abs/2106.09685"	url: "https://arxiv.org/abs/2106.09685"
- id: 3	- id: 3
text: "Rafailov, R. et al. 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model.' NeurIPS, 2023."	text: "Rafailov, R. et al. 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model.' NeurIPS, 2023."
url: "https://arxiv.org/abs/2305.18290"	url: "https://arxiv.org/abs/2305.18290"
paired_with: yo-yo-lora-training-pipeline.es.md	paired_with: yo-yo-lora-training-pipeline.es.md
---	---

Yo-Yo #1 is a g2-standard-4 Google Cloud spot instance equipped with a	Yo-Yo #1 is a g2-standard-4 Google Cloud spot instance equipped with a
single NVIDIA L4 GPU (24 GB VRAM). On each nightly run, it executes a	single NVIDIA L4 GPU (24 GB VRAM). On each nightly run, it executes a
two-phase, four-hour pipeline that produces fine-tuned adapter weights for	two-phase, four-hour pipeline that produces fine-tuned adapter weights for
the workspace language model. Phase 1 extracts structured business entities	the workspace language model. Phase 1 extracts structured business entities
from the deployment data corpus and writes them to a property graph. Phase 2	from the deployment data corpus and writes them to a property graph. Phase 2
reads accumulated engineering and apprenticeship training tuples, checks	reads accumulated engineering and apprenticeship training tuples, checks
whether the corpus has crossed a minimum threshold, and runs a	whether the corpus has crossed a minimum threshold, and runs a
parameter-efficient training pass against the base model. The two phases	parameter-efficient training pass against the base model. The two phases
are mandatory and sequential — they cannot overlap because both require	are mandatory and sequential — they cannot overlap because both require
exclusive access to the L4 GPU.	exclusive access to the L4 GPU.

## Why the Phases Are Separate	## Why the Phases Are Separate

The L4 GPU serves two incompatible workloads within the nightly window.	The L4 GPU serves two incompatible workloads within the nightly window.
During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run	During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run
entity extraction inference. During Phase 2, the QLoRA training loop loads	entity extraction inference. During Phase 2, the QLoRA training loop loads
OLMo 3 7B Think safetensors for gradient computation. A GPU cannot serve	OLMo 3 7B Think safetensors for gradient computation. A GPU cannot serve
an active vLLM inference process and a PyTorch training loop simultaneously	an active vLLM inference process and a PyTorch training loop simultaneously
— memory addresses conflict and context switching between CUDA kernels at	— memory addresses conflict and context switching between CUDA kernels at
this scale is not supported. `nightly-run.sh` enforces the boundary	this scale is not supported. `nightly-run.sh` enforces the boundary
explicitly: Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM	explicitly: Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM
process and frees the GPU before Phase 2 begins. Each phase has a	process and frees the GPU before Phase 2 begins. Each phase has a
configurable budget, defaulting to 7200 seconds.	configurable budget, defaulting to 7200 seconds.

## Phase 1 — DataGraph Rebuild	## Phase 1 — DataGraph Rebuild

At the start of the nightly window, `start-yoyo.sh` boots the Yo-Yo #1	At the start of the nightly window, `start-yoyo.sh` boots the Yo-Yo #1
VM and waits up to 90 minutes for vLLM to signal readiness. Once the	VM and waits up to 90 minutes for vLLM to signal readiness. Once the
inference server is live, `nightly-datagraph-rebuild.sh` processes three	inference server is live, `nightly-datagraph-rebuild.sh` processes three
document streams from the deployment: meeting transcript markdown files,	document streams from the deployment: meeting transcript markdown files,
agent research YAML and markdown files, and contact source JSON records. For each document, the script calls	agent research YAML and markdown files, and contact source JSON records. For each document, the script calls
`POST :9080/v1/chat/completions` through the Doorman, which routes the	`POST :9080/v1/chat/completions` through the Doorman, which routes the
payload to the 32B Think model on the Yo-Yo VM. The model returns a	payload to the 32B Think model on the Yo-Yo VM. The model returns a
structured JSON array of named entities — people, companies, projects,	structured JSON array of named entities — people, companies, projects,
accounts, and locations — constrained by a JSON Schema grammar so the	accounts, and locations — constrained by a JSON Schema grammar so the
output is machine-parseable without post-processing. The script then calls	output is machine-parseable without post-processing. The script then calls
`POST :9081/v1/graph/mutate` on service-content to write those entities	`POST :9081/v1/graph/mutate` on service-content to write those entities
into LadybugDB. A local ledger of processed document hashes ensures each	into LadybugDB. A local ledger of processed document hashes ensures each
document is processed exactly once across multiple nightly runs.	document is processed exactly once across multiple nightly runs.

At the end of Phase 1, vLLM stops and the GPU is released.	At the end of Phase 1, vLLM stops and the GPU is released.

## Phase 2 — Adapter Training	## Phase 2 — Adapter Training

`corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples	`corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples
in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from	in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from
cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs	cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs
drawn from the apprenticeship routing substrate). When either bucket reaches	drawn from the apprenticeship routing substrate). When either bucket reaches
50 tuples, the script writes a training-pending marker file and, if the	50 tuples, the script writes a training-pending marker file and, if the
`SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the	`SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the
relevant corpus directory to the configured GCS bucket.	relevant corpus directory to the configured GCS bucket.

On the Yo-Yo VM, `lora-training.sh` polls the training-pending directory	On the Yo-Yo VM, `lora-training.sh` polls the training-pending directory
every 30 seconds. When a marker appears, it claims the marker with an	every 30 seconds. When a marker appears, it claims the marker with an
atomic rename (appending `.claimed`), pulls the corpus from GCS, and runs	atomic rename (appending `.claimed`), pulls the corpus from GCS, and runs
QLoRA using the peft, bitsandbytes, and trl libraries.	QLoRA using the peft, bitsandbytes, and trl libraries.

## What QLoRA Is	## What QLoRA Is

QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning	QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning
method that loads a base model in 4-bit NF4 quantisation and trains a small	method that loads a base model in 4-bit NF4 quantisation and trains a small
set of additional weight matrices — called an adapter — rather than updating	set of additional weight matrices — called an adapter — rather than updating
the full model. [^1] For a 7B-parameter model like OLMo 3 7B Think, 4-bit	the full model. [^1] For a 7B-parameter model like OLMo 3 7B Think, 4-bit
quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to	quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to
approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the	approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the
training loop itself. The adapter targets seven linear projection layers:	training loop itself. The adapter targets seven linear projection layers:
`q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and	`q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and
`down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32	`down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32
(`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient	(`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient
checkpointing enabled to manage activation memory. [^2]	checkpointing enabled to manage activation memory. [^2]

The training configuration is intentionally conservative. The goal is to	The training configuration is intentionally conservative. The goal is to
shift the base model toward the vocabulary, formatting patterns, and	shift the base model toward the vocabulary, formatting patterns, and
structural conventions that appear in the engineering and apprenticeship	structural conventions that appear in the engineering and apprenticeship
corpora — not to retrain the model on a general task. Two epochs over	corpora — not to retrain the model on a general task. Two epochs over
hundreds of tuples is sufficient for this narrow shift.	hundreds of tuples is sufficient for this narrow shift.

## The Two Corpus Streams	## The Two Corpus Streams

Engineering tuples are SFT (supervised fine-tuning) pairs drawn from	Engineering tuples are SFT (supervised fine-tuning) pairs drawn from
actual commit diffs, commit messages, and review briefs across all clusters	actual commit diffs, commit messages, and review briefs across all clusters
in the workspace. They teach the model the precise technical vocabulary and	in the workspace. They teach the model the precise technical vocabulary and
structural patterns used in the engineering workflow: how diffs are	structural patterns used in the engineering workflow: how diffs are
described, how review comments are phrased, and how implementation decisions	described, how review comments are phrased, and how implementation decisions
are documented.	are documented.

Apprenticeship tuples are DPO (direct preference optimisation) pairs	Apprenticeship tuples are DPO (direct preference optimisation) pairs
produced by the apprenticeship routing substrate. Each pair consists of a	produced by the apprenticeship routing substrate. Each pair consists of a
shadow response (the model's unguided output) and a verdict response (the	shadow response (the model's unguided output) and a verdict response (the
preferred formulation confirmed by the operator). DPO training on these	preferred formulation confirmed by the operator). DPO training on these
pairs moves the model toward the preferred response distribution without	pairs moves the model toward the preferred response distribution without
requiring explicit labels for every token. [^3]	requiring explicit labels for every token. [^3]

## Adapter Output and Publication	## Adapter Output and Publication

When training completes, the adapter is saved to	When training completes, the adapter is saved to
`/data/weights/adapters/<tenant>/<role>/v<N>/` on the Yo-Yo VM. The adapter	`/data/weights/adapters/<tenant>/<role>/v<N>/` on the Yo-Yo VM. The adapter
directory contains the LoRA weight files and tokenizer configuration — total	directory contains the LoRA weight files and tokenizer configuration — total
size is typically 1 to 3 GB. `lora-training.sh` then signals	size is typically 1 to 3 GB. `lora-training.sh` then signals
`adapter-publish.service`, which uploads the adapter directory to the	`adapter-publish.service`, which uploads the adapter directory to the
configured GCS bucket. The adapter is subsequently available to the	configured GCS bucket. The adapter is subsequently available to the
workspace Doorman for loading as an inference-time weight overlay on the	workspace Doorman for loading as an inference-time weight overlay on the
base model. The marker file is renamed to `.completed` when all steps	base model. The marker file is renamed to `.completed` when all steps
succeed.	succeed.

## Adapter Training Versus Continued Pre-Training	## Adapter Training Versus Continued Pre-Training

The nightly LoRA process is adapter training. It produces a weight delta —	The nightly LoRA process is adapter training. It produces a weight delta —
a few gigabytes of parameters — that the base model loads at inference time.	a few gigabytes of parameters — that the base model loads at inference time.
It runs in approximately two hours on a single L4 GPU and operates over	It runs in approximately two hours on a single L4 GPU and operates over
hundreds to low thousands of training tuples. The base model itself is not	hundreds to low thousands of training tuples. The base model itself is not
modified.	modified.

Continued pre-training (CPT) is a distinct operation at a fundamentally	Continued pre-training (CPT) is a distinct operation at a fundamentally
different scale. CPT would produce a new base model checkpoint by training	different scale. CPT would produce a new base model checkpoint by training
on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four	on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four
weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is	weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is
operator-triggered, never automated, and never scheduled as part of the	operator-triggered, never automated, and never scheduled as part of the
nightly pipeline. The first-cut CPT target is Q1 2027, contingent on	nightly pipeline. The first-cut CPT target is Q1 2027, contingent on
corpus volume and operator decision. Until that decision is made, all	corpus volume and operator decision. Until that decision is made, all
nightly training is adapter-only.	nightly training is adapter-only.

## Current Status	## Current Status

The nightly pipeline code is complete. The workspace language model service	The nightly pipeline code is complete. The workspace language model service
passes 177 of 177 tests. The Packer image rebuild that bakes the training	passes 177 of 177 tests. The Packer image rebuild that bakes the training
Python stack (peft, bitsandbytes, trl) into the Yo-Yo VM is the next	Python stack (peft, bitsandbytes, trl) into the Yo-Yo VM is the next
intended operator action. Once that image is deployed, `lora-training.service`	intended operator action. Once that image is deployed, `lora-training.service`
on the Yo-Yo VM is intended to be enabled with	on the Yo-Yo VM is intended to be enabled with
`systemctl enable --now lora-training.service`. Until the image is rebuilt,	`systemctl enable --now lora-training.service`. Until the image is rebuilt,
the training phase runs in marker-only mode: `corpus-threshold.py` writes	the training phase runs in marker-only mode: `corpus-threshold.py` writes
and dispatches the GCS marker, but `lora-training.sh` is not yet active	and dispatches the GCS marker, but `lora-training.sh` is not yet active
on the runtime VM image.	on the runtime VM image.