Diff: architecture/elastic-compute-lora-training-pipeline
From 1c02ec1 to 1c02ec1
+0 / −0 lines
| Before | After |
|---|---|
| --- | --- |
| schema: foundry-doc-v1 | schema: foundry-doc-v1 |
| title: "Elastic Compute #1 nightly LoRA training pipeline" | title: "Elastic Compute #1 nightly LoRA training pipeline" |
| slug: elastic-compute-lora-training-pipeline | slug: elastic-compute-lora-training-pipeline |
| category: architecture | category: architecture |
| type: concept | type: concept |
| content_type: topic | content_type: topic |
| quality: pre-build | quality: pre-build |
| status: pre-build | status: pre-build |
| audience: vendor-public | audience: vendor-public |
| bcsc_class: current-fact | bcsc_class: current-fact |
| language_protocol: PROSE-TOPIC | language_protocol: PROSE-TOPIC |
| last_edited: 2026-05-25 | last_edited: 2026-05-25 |
| editor: pointsav-engineering | editor: pointsav-engineering |
| paired_with: elastic-compute-lora-training-pipeline.es.md | paired_with: elastic-compute-lora-training-pipeline.es.md |
| short_description: "Elastic Compute #1 runs a nightly two-phase pipeline that rebuilds the deployment DataGraph and produces LoRA adapter weights for the workspace language model. Phase 1 uses a 32B inference model for entity extraction; Phase 2 trains a 7B parameter-efficient adapter from engineering and apprenticeship corpora." | short_description: "Elastic Compute #1 runs a nightly two-phase pipeline that rebuilds the deployment DataGraph and produces LoRA adapter weights for the workspace language model. Phase 1 uses a 32B inference model for entity extraction; Phase 2 trains a 7B parameter-efficient adapter from engineering and apprenticeship corpora." |
| cites: [] | cites: [] |
| --- | --- |
| The [[pointsav-overview|PointSav]] [[compounding-substrate|compounding substrate]] requires periodic retraining to incorporate the operator interactions and editorial decisions accumulated since the previous cycle. Elastic Compute #1 is the compute node that runs this retraining nightly — a GPU-equipped cloud spot instance ([[yoyo-compute-substrate|Yo-Yo compute]]) that rebuilds the knowledge graph and produces updated LoRA (Low-Rank Adaptation) adapter weights for the platform's local language model. The pipeline operationalises the theoretical claim that every productive session improves the platform for the next one: it converts raw interaction data into model weights the next session inherits. | The [[pointsav-overview|PointSav]] [[compounding-substrate|compounding substrate]] requires periodic retraining to incorporate the operator interactions and editorial decisions accumulated since the previous cycle. Elastic Compute #1 is the compute node that runs this retraining nightly — a GPU-equipped cloud spot instance ([[yoyo-compute-substrate|Yo-Yo compute]]) that rebuilds the knowledge graph and produces updated LoRA (Low-Rank Adaptation) adapter weights for the platform's local language model. The pipeline operationalises the theoretical claim that every productive session improves the platform for the next one: it converts raw interaction data into model weights the next session inherits. |
| Elastic Compute #1 is a g2-standard-4 Google Cloud spot instance equipped with a single NVIDIA L4 GPU (24 GB VRAM). Each night it runs a two-phase, four-hour pipeline that produces fine-tuned adapter weights for the workspace language model. Phase 1 extracts structured business entities from the operator data corpus and writes them to a property graph. Phase 2 reads accumulated engineering and apprenticeship training tuples, checks whether the corpus has crossed a minimum threshold, and runs a parameter-efficient training pass against the base model. The two phases are mandatory and sequential — they cannot overlap because both require exclusive access to the L4 GPU. | Elastic Compute #1 is a g2-standard-4 Google Cloud spot instance equipped with a single NVIDIA L4 GPU (24 GB VRAM). Each night it runs a two-phase, four-hour pipeline that produces fine-tuned adapter weights for the workspace language model. Phase 1 extracts structured business entities from the operator data corpus and writes them to a property graph. Phase 2 reads accumulated engineering and apprenticeship training tuples, checks whether the corpus has crossed a minimum threshold, and runs a parameter-efficient training pass against the base model. The two phases are mandatory and sequential — they cannot overlap because both require exclusive access to the L4 GPU. |
| ## Why the phases are separate | ## Why the phases are separate |
| The L4 GPU serves two incompatible workloads within the nightly window. | The L4 GPU serves two incompatible workloads within the nightly window. |
| During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run entity | During Phase 1, vLLM loads OLMo 3 32B Think (4-bit quantised) to run entity |
| extraction inference. During Phase 2, the QLoRA training loop loads OLMo 3 | extraction inference. During Phase 2, the QLoRA training loop loads OLMo 3 |
| 7B Think safetensors for gradient computation. A GPU cannot serve an active | 7B Think safetensors for gradient computation. A GPU cannot serve an active |
| vLLM inference process and a PyTorch training loop simultaneously — memory | vLLM inference process and a PyTorch training loop simultaneously — memory |
| addresses conflict and context switching between CUDA kernels at this scale | addresses conflict and context switching between CUDA kernels at this scale |
| is not supported. `nightly-run.sh` enforces the boundary explicitly: | is not supported. `nightly-run.sh` enforces the boundary explicitly: |
| Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM process and frees | Phase 1 ends with `stop-yoyo.sh`, which drains the vLLM process and frees |
| the GPU before Phase 2 begins. Each phase has a configurable budget, | the GPU before Phase 2 begins. Each phase has a configurable budget, |
| defaulting to 7200 seconds (two hours) each. | defaulting to 7200 seconds (two hours) each. |
| ## Phase 1 — DataGraph rebuild | ## Phase 1 — DataGraph rebuild |
| At the start of the nightly window, `start-yoyo.sh` boots the Elastic Compute #1 VM | At the start of the nightly window, `start-yoyo.sh` boots the Elastic Compute #1 VM |
| and waits up to 90 minutes for vLLM to signal readiness. Once the inference | and waits up to 90 minutes for vLLM to signal readiness. Once the inference |
| server is live, `jennifer-datagraph-rebuild.sh` processes three document | server is live, `jennifer-datagraph-rebuild.sh` processes three document |
| streams from the operator deployment: meeting transcript markdown files, | streams from the operator deployment: meeting transcript markdown files, |
| agent research YAML and markdown files, and contact source JSON records. | agent research YAML and markdown files, and contact source JSON records. |
| For each document, the script calls `POST :9080/v1/chat/completions` through | For each document, the script calls `POST :9080/v1/chat/completions` through |
| the [[doorman-protocol|Doorman]], which routes the payload to the 32B Think model on the Elastic Compute VM. | the [[doorman-protocol|Doorman]], which routes the payload to the 32B Think model on the Elastic Compute VM. |
| The model returns a structured JSON array of named entities — people, | The model returns a structured JSON array of named entities — people, |
| companies, projects, accounts, and locations — constrained by a JSON Schema | companies, projects, accounts, and locations — constrained by a JSON Schema |
| grammar so the output is machine-parseable without post-processing. The | grammar so the output is machine-parseable without post-processing. The |
| script then calls `POST :9081/v1/graph/mutate` on [[service-content]] to write | script then calls `POST :9081/v1/graph/mutate` on [[service-content]] to write |
| those entities into LadybugDB. A local ledger of processed document hashes | those entities into LadybugDB. A local ledger of processed document hashes |
| ensures each document is processed exactly once across multiple nightly runs. | ensures each document is processed exactly once across multiple nightly runs. |
| At the end of Phase 1, vLLM stops and the GPU is released. | At the end of Phase 1, vLLM stops and the GPU is released. |
| ## Phase 2 — Adapter training | ## Phase 2 — Adapter training |
| `corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples | `corpus-threshold.py` runs at the start of Phase 2. It counts JSONL tuples |
| in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from | in two corpus buckets — `engineering-pointsav` (SFT tuples drawn from |
| cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs | cross-cluster engineering commits) and `apprenticeship-pointsav` (DPO pairs |
| drawn from the apprenticeship routing substrate). When either bucket reaches | drawn from the apprenticeship routing substrate). When either bucket reaches |
| 50 tuples, the script writes a training-pending marker file and, if the | 50 tuples, the script writes a training-pending marker file and, if the |
| `SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the | `SLM_YOYO_WEIGHTS_GCS_BUCKET` environment variable is set, syncs the |
| relevant corpus directory to the configured GCS bucket. | relevant corpus directory to the configured GCS bucket. |
| On the Elastic Compute VM, `lora-training.sh` polls the training-pending directory | On the Elastic Compute VM, `lora-training.sh` polls the training-pending directory |
| every 30 seconds. When a marker appears, it claims the marker with an atomic | every 30 seconds. When a marker appears, it claims the marker with an atomic |
| rename (appending `.claimed`), pulls the corpus from GCS, and runs QLoRA | rename (appending `.claimed`), pulls the corpus from GCS, and runs QLoRA |
| using the peft, bitsandbytes, and trl libraries. | using the peft, bitsandbytes, and trl libraries. |
| ## What QLoRA is | ## What QLoRA is |
| QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning | QLoRA (Quantised Low-Rank Adaptation) is a parameter-efficient fine-tuning |
| method that loads a base model in 4-bit NF4 quantisation and trains a small | method that loads a base model in 4-bit NF4 quantisation and trains a small |
| set of additional weight matrices — called an adapter — rather than updating | set of additional weight matrices — called an adapter — rather than updating |
| the full model. For a 7B-parameter model like OLMo 3 7B Think, 4-bit | the full model. For a 7B-parameter model like OLMo 3 7B Think, 4-bit |
| quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to | quantisation reduces the GPU footprint from roughly 14 GB (in bfloat16) to |
| approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the | approximately 6 GB, leaving adequate headroom on the 24 GB L4 for the |
| training loop itself. The adapter targets seven linear projection layers: | training loop itself. The adapter targets seven linear projection layers: |
| `q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and | `q_proj`, `v_proj`, `k_proj`, `o_proj`, `gate_proj`, `up_proj`, and |
| `down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32 | `down_proj`. Training runs for two epochs with rank 16 (`r=16`), alpha 32 |
| (`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient | (`lora_alpha=32`), a maximum sequence length of 512 tokens, and gradient |
| checkpointing enabled to manage activation memory. | checkpointing enabled to manage activation memory. |
| The training configuration is intentionally conservative. The goal is to | The training configuration is intentionally conservative. The goal is to |
| shift the base model toward the vocabulary, formatting patterns, and | shift the base model toward the vocabulary, formatting patterns, and |
| structural conventions that appear in the engineering and apprenticeship | structural conventions that appear in the engineering and apprenticeship |
| corpora — not to retrain the model on a general task. Two epochs over | corpora — not to retrain the model on a general task. Two epochs over |
| hundreds of tuples is sufficient for this narrow shift. | hundreds of tuples is sufficient for this narrow shift. |
| ## The two corpus streams | ## The two corpus streams |
| **Engineering tuples** are SFT (supervised fine-tuning) pairs drawn from | **Engineering tuples** are SFT (supervised fine-tuning) pairs drawn from |
| actual commit diffs, commit messages, and review briefs across all clusters | actual commit diffs, commit messages, and review briefs across all clusters |
| in the workspace. They teach the model the precise technical vocabulary and | in the workspace. They teach the model the precise technical vocabulary and |
| structural patterns used in the engineering workflow: how diffs are described, | structural patterns used in the engineering workflow: how diffs are described, |
| how review comments are phrased, and how implementation decisions are | how review comments are phrased, and how implementation decisions are |
| documented. | documented. |
| **Apprenticeship tuples** are DPO (direct preference optimisation) pairs | **Apprenticeship tuples** are DPO (direct preference optimisation) pairs |
| produced by the [[apprenticeship-substrate|apprenticeship routing substrate]]. Each pair consists of a | produced by the [[apprenticeship-substrate|apprenticeship routing substrate]]. Each pair consists of a |
| shadow response (the model's unguided output) and a verdict response (the | shadow response (the model's unguided output) and a verdict response (the |
| preferred formulation confirmed by the operator). DPO training on these pairs | preferred formulation confirmed by the operator). DPO training on these pairs |
| moves the model toward the preferred response distribution without requiring | moves the model toward the preferred response distribution without requiring |
| explicit labels for every token. | explicit labels for every token. |
| ## Adapter output and publication | ## Adapter output and publication |
| When training completes, the adapter is saved to | When training completes, the adapter is saved to |
| `/data/weights/adapters/<tenant>/<role>/v<N>/` on the Elastic Compute VM. The adapter | `/data/weights/adapters/<tenant>/<role>/v<N>/` on the Elastic Compute VM. The adapter |
| directory contains the LoRA weight files and tokenizer configuration — total | directory contains the LoRA weight files and tokenizer configuration — total |
| size is typically 1 to 3 GB. `lora-training.sh` then signals | size is typically 1 to 3 GB. `lora-training.sh` then signals |
| `adapter-publish.service`, which uploads the adapter directory to the | `adapter-publish.service`, which uploads the adapter directory to the |
| configured GCS bucket. The adapter is subsequently available to the workspace | configured GCS bucket. The adapter is subsequently available to the workspace |
| [[doorman-protocol|Doorman]] for loading as an inference-time weight overlay on the base model via [[adapter-composition|adapter composition]]. | [[doorman-protocol|Doorman]] for loading as an inference-time weight overlay on the base model via [[adapter-composition|adapter composition]]. |
| The marker file is renamed to `.completed` when all steps succeed. | The marker file is renamed to `.completed` when all steps succeed. |
| ## Adapter training versus continued pre-training | ## Adapter training versus continued pre-training |
| The nightly LoRA process is adapter training. It produces a weight delta — | The nightly LoRA process is adapter training. It produces a weight delta — |
| a few gigabytes of parameters — that the base model loads at inference time. | a few gigabytes of parameters — that the base model loads at inference time. |
| It runs in approximately two hours on a single L4 GPU and operates over | It runs in approximately two hours on a single L4 GPU and operates over |
| hundreds to low thousands of training tuples. The base model itself is not | hundreds to low thousands of training tuples. The base model itself is not |
| modified. | modified. |
| Continued pre-training (CPT) is a distinct operation at a fundamentally | Continued pre-training (CPT) is a distinct operation at a fundamentally |
| different scale. CPT would produce a new base model checkpoint by training | different scale. CPT would produce a new base model checkpoint by training |
| on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four | on 50 to 100 billion tokens across 8 to 32 H100-class GPUs for one to four |
| weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is | weeks. The cost per CPT cycle runs to tens of thousands of dollars. CPT is |
| operator-triggered, never automated, and never scheduled as part of the | operator-triggered, never automated, and never scheduled as part of the |
| nightly pipeline. The first-cut CPT target is Q1 2027, contingent on | nightly pipeline. The first-cut CPT target is Q1 2027, contingent on |
| corpus volume and operator decision. Until that decision is made, all | corpus volume and operator decision. Until that decision is made, all |
| nightly training is adapter-only. | nightly training is adapter-only. |
| ## Current status | ## Current status |
| The nightly pipeline code is complete. The workspace language model service | The nightly pipeline code is complete. The workspace language model service |
| passes 177 of 177 tests. The Packer image rebuild that bakes the training | passes 177 of 177 tests. The Packer image rebuild that bakes the training |
| Python stack (peft, bitsandbytes, trl) into the Elastic Compute VM is the next | Python stack (peft, bitsandbytes, trl) into the Elastic Compute VM is the next |
| intended operator action. Once that image is deployed, `lora-training.service` | intended operator action. Once that image is deployed, `lora-training.service` |
| on the Elastic Compute VM will be enabled with `systemctl enable --now lora-training.service`. | on the Elastic Compute VM will be enabled with `systemctl enable --now lora-training.service`. |
| Until the image is rebuilt, the training phase runs in marker-only mode: | Until the image is rebuilt, the training phase runs in marker-only mode: |
| `corpus-threshold.py` writes and dispatches the GCS marker, but | `corpus-threshold.py` writes and dispatches the GCS marker, but |
| `lora-training.sh` is not yet active on the runtime VM image. | `lora-training.sh` is not yet active on the runtime VM image. |
| ## See also | ## See also |
| - [[service-slm-graph-store-migration]] — the DataGraph rebuild that forms Phase 1 of this pipeline | - [[service-slm-graph-store-migration]] — the DataGraph rebuild that forms Phase 1 of this pipeline |
| - [[service-slm]] — the service that orchestrates both phases | - [[service-slm]] — the service that orchestrates both phases |