Diff: substrate/trajectory-substrate

From af1a599 to af1a599

+0 / −0 lines

Before	After
---	---
schema: foundry-doc-v1	schema: foundry-doc-v1
title: "The trajectory substrate"	title: "The trajectory substrate"
slug: trajectory-substrate	slug: trajectory-substrate
category: substrate	category: substrate
type: topic	type: topic
quality: complete	quality: complete
short_description: "The platform mechanism that converts operational work — commits, sessions, operator feedback — into structured JSONL training tuples, routing them into a continued-pretraining corpus that improves the OLMo base model over time."	short_description: "The platform mechanism that converts operational work — commits, sessions, operator feedback — into structured JSONL training tuples, routing them into a continued-pretraining corpus that improves the OLMo base model over time."
status: active	status: active
bcsc_class: public-disclosure-safe	bcsc_class: public-disclosure-safe
last_edited: 2026-05-15	last_edited: 2026-05-15
editor: pointsav-engineering	editor: pointsav-engineering
cites:	cites:
- ni-51-102	- ni-51-102
- osc-sn-51-721	- osc-sn-51-721
- constitutional-ai-2212-08073	- constitutional-ai-2212-08073
- federated-lora-2502-05087	- federated-lora-2502-05087
- s-lora-2024	- s-lora-2024
- lorax-predibase	- lorax-predibase
- olmo3-allenai	- olmo3-allenai
paired_with: trajectory-substrate.es.md	paired_with: trajectory-substrate.es.md
---	---

Every commit to the platform's code repositories, every editorial session, every operator correction that marks a suggestion wrong — these interactions are not discarded. They are captured as structured JSONL tuples, tagged with provenance metadata, and routed into a training corpus whose accumulated signal improves the OLMo base model each time a continued-pretraining run closes.	Every commit to the platform's code repositories, every editorial session, every operator correction that marks a suggestion wrong — these interactions are not discarded. They are captured as structured JSONL tuples, tagged with provenance metadata, and routed into a training corpus whose accumulated signal improves the OLMo base model each time a continued-pretraining run closes.

Three orthogonal corpus types determine the architecture. The constitutional corpus captures what the platform's governance charter says a session of each role may and may not do — universal, loaded by every platform deployment. The engineering corpus captures contributor session trajectories and is vendor-scoped. The tenant-runtime corpus captures what flows through Ring 1 inside each customer deployment and never leaves that deployment unless the customer explicitly opts into the federated adapter marketplace (a planned forward-looking feature).	Three orthogonal corpus types determine the architecture. The constitutional corpus captures what the platform's governance charter says a session of each role may and may not do — universal, loaded by every platform deployment. The engineering corpus captures contributor session trajectories and is vendor-scoped. The tenant-runtime corpus captures what flows through Ring 1 inside each customer deployment and never leaves that deployment unless the customer explicitly opts into the federated adapter marketplace (a planned forward-looking feature).

Capture is automatic — no operator decision is required to generate a training tuple. Every JSONL record carries provenance fields (`tuple_type`, `doctrine_version`, `tenant`, `role`, `scope`, `redaction_class`) that let the training pipeline assemble each corpus without trusting prose. Vendor data never co-mingles with customer data at training time; tenant data never crosses tenants — the separation is directory-level and pipeline-level, not policy-level.	Capture is automatic — no operator decision is required to generate a training tuple. Every JSONL record carries provenance fields (`tuple_type`, `doctrine_version`, `tenant`, `role`, `scope`, `redaction_class`) that let the training pipeline assemble each corpus without trusting prose. Vendor data never co-mingles with customer data at training time; tenant data never crosses tenants — the separation is directory-level and pipeline-level, not policy-level.

An operator's specific correction shapes their own cluster's adapter exclusively — no other tenant benefits or is burdened by it. This per-contributor inverse of aggregate preference learning is structurally inaccessible to platforms whose preference signal is averaged across all users. Per `[ni-51-102]` and `[osc-sn-51-721]`, the continued-pretraining pipeline is described in planned terms; the capture infrastructure and corpus accumulation are operational today.	An operator's specific correction shapes their own cluster's adapter exclusively — no other tenant benefits or is burdened by it. This per-contributor inverse of aggregate preference learning is structurally inaccessible to platforms whose preference signal is averaged across all users. Per `[ni-51-102]` and `[osc-sn-51-721]`, the continued-pretraining pipeline is described in planned terms; the capture infrastructure and corpus accumulation are operational today.

## Overview	## Overview

The Trajectory Substrate converts operational work into continued-pretraining signal without interrupting that work. It operates in the background of every commit, every session, and every operator feedback exchange.	The Trajectory Substrate converts operational work into continued-pretraining signal without interrupting that work. It operates in the background of every commit, every session, and every operator feedback exchange.

Three properties distinguish a trajectory substrate from a generic fine-tuning pipeline:	Three properties distinguish a trajectory substrate from a generic fine-tuning pipeline:

1. Capture is automatic. No operator decision is required to generate a training tuple. The `git post-commit` hook fires; the session-end script fires; the rejected-suggestion script fires. Work produces signal by existing.	1. Capture is automatic. No operator decision is required to generate a training tuple. The `git post-commit` hook fires; the session-end script fires; the rejected-suggestion script fires. Work produces signal by existing.
2. Provenance is structural. Every JSONL record carries the governance version it was produced under, the tenant it belongs to, the role of the session, the cluster, and its redaction class. The training pipeline does not trust prose; it filters on these fields.	2. Provenance is structural. Every JSONL record carries the governance version it was produced under, the tenant it belongs to, the role of the session, the cluster, and its redaction class. The training pipeline does not trust prose; it filters on these fields.
3. Corpus boundaries are enforced by infrastructure. Vendor data never co-mingles with customer data at training time. Tenant data never crosses tenants. The separation is directory-level and pipeline-level — not policy-level.	3. Corpus boundaries are enforced by infrastructure. Vendor data never co-mingles with customer data at training time. Tenant data never crosses tenants. The separation is directory-level and pipeline-level — not policy-level.