Diff: substrate/yo-yo-lora-training-pipeline
From 9fca6cd to 9fca6cd
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "Yo-Yo #1 nightly LoRA training pipeline" | title: "Yo-Yo #1 nightly LoRA training pipeline" |
| slug: yo-yo-lora-training-pipeline | slug: yo-yo-lora-training-pipeline |
| category: substrate | category: substrate |
| type: topic | type: topic |
| quality: complete | quality: complete |
| short_description: "The nightly two-phase pipeline on Yo-Yo #1: Phase 1 runs entity extraction for the business DataGraph; Phase 2 trains a LoRA adapter against engineering and apprenticeship corpora using QLoRA on a single L4 GPU." | short_description: "The nightly two-phase pipeline on Yo-Yo #1: Phase 1 runs entity extraction for the business DataGraph; Phase 2 trains a LoRA adapter against engineering and apprenticeship corpora using QLoRA on a single L4 GPU." |
| status: active | status: active |
| bcsc_class: public-disclosure-safe | bcsc_class: public-disclosure-safe |
| last_edited: 2026-05-15 | last_edited: 2026-05-15 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| cites: [] | cites: [] |
| references: | references: |
| - id: 1 | - id: 1 |
| text: "Dettmers, T. et al. 'QLoRA: Efficient Finetuning of Quantized LLMs.' NeurIPS, 2023." | text: "Dettmers, T. et al. 'QLoRA: Efficient Finetuning of Quantized LLMs.' NeurIPS, 2023." |
| url: "https://arxiv.org/abs/2305.14314" | url: "https://arxiv.org/abs/2305.14314" |
| - id: 2 | - id: 2 |
| text: "Hu, E. et al. 'LoRA: Low-Rank Adaptation of Large Language Models.' ICLR, 2022." | text: "Hu, E. et al. 'LoRA: Low-Rank Adaptation of Large Language Models.' ICLR, 2022." |
| url: "https://arxiv.org/abs/2106.09685" | url: "https://arxiv.org/abs/2106.09685" |
| - id: 3 | - id: 3 |
| text: "Rafailov, R. et al. 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model.' NeurIPS, 2023." | text: "Rafailov, R. et al. 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model.' NeurIPS, 2023." |
| url: "https://arxiv.org/abs/2305.18290" | url: "https://arxiv.org/abs/2305.18290" |
| paired_with: yo-yo-lora-training-pipeline.es.md | paired_with: yo-yo-lora-training-pipeline.es.md |
| --- | --- |
| Yo-Yo #1 is a g2-standard-4 Google Cloud spot instance equipped with a | Yo-Yo #1 is a g2-standard-4 Google Cloud spot instance equipped with a |
| single NVIDIA L4 GPU (24 GB VRAM). On each nightly run, it executes a | single NVIDIA L4 GPU (24 GB VRAM). On each nightly run, it executes a |
| two-phase, four-hour pipeline that produces fine-tuned adapter weights for | two-phase, four-hour pipeline that produces fine-tuned adapter weights for |
| the workspace language model. Phase 1 extracts structured business entities | the workspace language model. Phase 1 extracts structured business entities |
| from the deployment data corpus and writes them to a property graph. Phase 2 | from the deployment data corpus and writes them to a property graph. Phase 2 |
| reads accumulated engineering and apprenticeship training tuples, checks | reads accumulated engineering and apprenticeship training tuples, checks |
| whether the corpus has crossed a minimum threshold, and runs a | whether the corpus has crossed a minimum threshold, and runs a |
| parameter-efficient training pass against the base model. The two phases | parameter-efficient training pass against the base model. The two phases |
| are mandatory and sequential — they cannot overlap because both require | are mandatory and sequential — they cannot overlap because both require |
| exclusive access to the L4 GPU. | exclusive access to the L4 GPU. |
| ## Why the Phases Are Separate | ## Why the Phases Are Separate |
| The L4 GPU serves two incompatible workloads within the nightly window. | The L4 GPU serves two incompatible workloads within the nightly window. |
| During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run | During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run |
| entity extraction inference. During Phase 2, the QLoRA training loop loads | entity extraction inference. During Phase 2, the QLoRA training loop loads |
| OLMo 3 7B Think safetensors for gradient computation. A GPU cannot serve | OLMo 3 7B Think safetensors for gradient computation. A GPU cannot serve |
| an active vLLM inference process and a PyTorch training loop simultaneously | an active vLLM inference process and a PyTorch training loop simultaneously |
| — memory addresses conflict and context switching between CUDA kernels at | — memory addresses conflict and context switching between CUDA kernels at |
| this scale is not supported. `nightly-run.sh` enforces the boundary | this scale is not supported. `nightly-run.sh` enforces the boundary |
| explicitly: Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM | explicitly: Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM |
| process and frees the GPU before Phase 2 begins. Each phase has a | process and frees the GPU before Phase 2 begins. Each phase has a |
| configurable budget, defaulting to 7200 seconds. | configurable budget, defaulting to 7200 seconds. |
| ## Phase 1 — DataGraph Rebuild | ## Phase 1 — DataGraph Rebuild |
| At the start of the nightly window, `start-yoyo.sh` boots the Yo-Yo #1 | At the start of the nightly window, `start-yoyo.sh` boots the Yo-Yo #1 |
| VM and waits up to 90 minutes for vLLM to signal readiness. Once the | VM and waits up to 90 minutes for vLLM to signal readiness. Once the |
| inference server is live, `nightly-datagraph-rebuild.sh` processes three | inference server is live, `nightly-datagraph-rebuild.sh` processes three |
| document streams from the deployment: meeting transcript markdown files, | document streams from the deployment: meeting transcript markdown files, |
| agent research YAML and markdown files, and contact source JSON records. For each document, the script calls | agent research YAML and markdown files, and contact source JSON records. For each document, the script calls |
| `POST :9080/v1/chat/completions` through the Doorman, which routes the | `POST :9080/v1/chat/completions` through the Doorman, which routes the |
| payload to the 32B Think model on the Yo-Yo VM. The model returns a | payload to the 32B Think model on the Yo-Yo VM. The model returns a |
| structured JSON array of named entities — people, companies, projects, | structured JSON array of named entities — people, companies, projects, |
| accounts, and locations — constrained by a JSON Schema grammar so the | accounts, and locations — constrained by a JSON Schema grammar so the |
| output is machine-parseable without post-processing. The script then calls | output is machine-parseable without post-processing. The script then calls |
| `POST :9081/v1/graph/mutate` on service-content to write those entities | `POST :9081/v1/graph/mutate` on service-content to write those entities |
| into LadybugDB. A local ledger of processed document hashes ensures each | into LadybugDB. A local ledger of processed document hashes ensures each |
| document is processed exactly once across multiple nightly runs. | document is processed exactly once across multiple nightly runs. |
| At the end of Phase 1, vLLM stops and the GPU is released. | At the end of Phase 1, vLLM stops and the GPU is released. |
| ## Phase 2 — Adapter Training | ## Phase 2 — Adapter Training |
| `corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples | `corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples |
| in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from | in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from |
| cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs | cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs |
| drawn from the apprenticeship routing substrate). When either bucket reaches | drawn from the apprenticeship routing substrate). When either bucket reaches |
| 50 tuples, the script writes a training-pending marker file and, if the | 50 tuples, the script writes a training-pending marker file and, if the |
| `SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the | `SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the |
| relevant corpus directory to the configured GCS bucket. | relevant corpus directory to the configured GCS bucket. |
| On the Yo-Yo VM, `lora-training.sh` polls the training-pending directory | On the Yo-Yo VM, `lora-training.sh` polls the training-pending directory |
| every 30 seconds. When a marker appears, it claims the marker with an | every 30 seconds. When a marker appears, it claims the marker with an |
| atomic rename (appending `.claimed`), pulls the corpus from GCS, and runs | atomic rename (appending `.claimed`), pulls the corpus from GCS, and runs |
| QLoRA using the peft, bitsandbytes, and trl libraries. | QLoRA using the peft, bitsandbytes, and trl libraries. |
| ## What QLoRA Is | ## What QLoRA Is |
| QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning | QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning |
| method that loads a base model in 4-bit NF4 quantisation and trains a small | method that loads a base model in 4-bit NF4 quantisation and trains a small |
| set of additional weight matrices — called an adapter — rather than updating | set of additional weight matrices — called an adapter — rather than updating |
| the full model. [^1] For a 7B-parameter model like OLMo 3 7B Think, 4-bit | the full model. [^1] For a 7B-parameter model like OLMo 3 7B Think, 4-bit |
| quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to | quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to |
| approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the | approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the |
| training loop itself. The adapter targets seven linear projection layers: | training loop itself. The adapter targets seven linear projection layers: |
| `q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and | `q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and |
| `down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32 | `down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32 |
| (`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient | (`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient |
| checkpointing enabled to manage activation memory. [^2] | checkpointing enabled to manage activation memory. [^2] |
| The training configuration is intentionally conservative. The goal is to | The training configuration is intentionally conservative. The goal is to |
| shift the base model toward the vocabulary, formatting patterns, and | shift the base model toward the vocabulary, formatting patterns, and |
| structural conventions that appear in the engineering and apprenticeship | structural conventions that appear in the engineering and apprenticeship |
| corpora — not to retrain the model on a general task. Two epochs over | corpora — not to retrain the model on a general task. Two epochs over |
| hundreds of tuples is sufficient for this narrow shift. | hundreds of tuples is sufficient for this narrow shift. |
| ## The Two Corpus Streams | ## The Two Corpus Streams |
| **Engineering tuples** are SFT (supervised fine-tuning) pairs drawn from | **Engineering tuples** are SFT (supervised fine-tuning) pairs drawn from |
| actual commit diffs, commit messages, and review briefs across all clusters | actual commit diffs, commit messages, and review briefs across all clusters |
| in the workspace. They teach the model the precise technical vocabulary and | in the workspace. They teach the model the precise technical vocabulary and |
| structural patterns used in the engineering workflow: how diffs are | structural patterns used in the engineering workflow: how diffs are |
| described, how review comments are phrased, and how implementation decisions | described, how review comments are phrased, and how implementation decisions |
| are documented. | are documented. |
| **Apprenticeship tuples** are DPO (direct preference optimisation) pairs | **Apprenticeship tuples** are DPO (direct preference optimisation) pairs |
| produced by the apprenticeship routing substrate. Each pair consists of a | produced by the apprenticeship routing substrate. Each pair consists of a |
| shadow response (the model's unguided output) and a verdict response (the | shadow response (the model's unguided output) and a verdict response (the |
| preferred formulation confirmed by the operator). DPO training on these | preferred formulation confirmed by the operator). DPO training on these |
| pairs moves the model toward the preferred response distribution without | pairs moves the model toward the preferred response distribution without |
| requiring explicit labels for every token. [^3] | requiring explicit labels for every token. [^3] |
| ## Adapter Output and Publication | ## Adapter Output and Publication |
| When training completes, the adapter is saved to | When training completes, the adapter is saved to |
| `/data/weights/adapters/<tenant>/<role>/v<N>/` on the Yo-Yo VM. The adapter | `/data/weights/adapters/<tenant>/<role>/v<N>/` on the Yo-Yo VM. The adapter |
| directory contains the LoRA weight files and tokenizer configuration — total | directory contains the LoRA weight files and tokenizer configuration — total |
| size is typically 1 to 3 GB. `lora-training.sh` then signals | size is typically 1 to 3 GB. `lora-training.sh` then signals |
| `adapter-publish.service`, which uploads the adapter directory to the | `adapter-publish.service`, which uploads the adapter directory to the |
| configured GCS bucket. The adapter is subsequently available to the | configured GCS bucket. The adapter is subsequently available to the |
| workspace Doorman for loading as an inference-time weight overlay on the | workspace Doorman for loading as an inference-time weight overlay on the |
| base model. The marker file is renamed to `.completed` when all steps | base model. The marker file is renamed to `.completed` when all steps |
| succeed. | succeed. |
| ## Adapter Training Versus Continued Pre-Training | ## Adapter Training Versus Continued Pre-Training |
| The nightly LoRA process is adapter training. It produces a weight delta — | The nightly LoRA process is adapter training. It produces a weight delta — |
| a few gigabytes of parameters — that the base model loads at inference time. | a few gigabytes of parameters — that the base model loads at inference time. |
| It runs in approximately two hours on a single L4 GPU and operates over | It runs in approximately two hours on a single L4 GPU and operates over |
| hundreds to low thousands of training tuples. The base model itself is not | hundreds to low thousands of training tuples. The base model itself is not |
| modified. | modified. |
| Continued pre-training (CPT) is a distinct operation at a fundamentally | Continued pre-training (CPT) is a distinct operation at a fundamentally |
| different scale. CPT would produce a new base model checkpoint by training | different scale. CPT would produce a new base model checkpoint by training |
| on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four | on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four |
| weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is | weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is |
| operator-triggered, never automated, and never scheduled as part of the | operator-triggered, never automated, and never scheduled as part of the |
| nightly pipeline. The first-cut CPT target is Q1 2027, contingent on | nightly pipeline. The first-cut CPT target is Q1 2027, contingent on |
| corpus volume and operator decision. Until that decision is made, all | corpus volume and operator decision. Until that decision is made, all |
| nightly training is adapter-only. | nightly training is adapter-only. |
| ## Current Status | ## Current Status |
| The nightly pipeline code is complete. The workspace language model service | The nightly pipeline code is complete. The workspace language model service |
| passes 177 of 177 tests. The Packer image rebuild that bakes the training | passes 177 of 177 tests. The Packer image rebuild that bakes the training |
| Python stack (peft, bitsandbytes, trl) into the Yo-Yo VM is the next | Python stack (peft, bitsandbytes, trl) into the Yo-Yo VM is the next |
| intended operator action. Once that image is deployed, `lora-training.service` | intended operator action. Once that image is deployed, `lora-training.service` |
| on the Yo-Yo VM is intended to be enabled with | on the Yo-Yo VM is intended to be enabled with |
| `systemctl enable --now lora-training.service`. Until the image is rebuilt, | `systemctl enable --now lora-training.service`. Until the image is rebuilt, |
| the training phase runs in marker-only mode: `corpus-threshold.py` writes | the training phase runs in marker-only mode: `corpus-threshold.py` writes |
| and dispatches the GCS marker, but `lora-training.sh` is not yet active | and dispatches the GCS marker, but `lora-training.sh` is not yet active |
| on the runtime VM image. | on the runtime VM image. |