Skip to content

Diff: architecture/spot-vm-lifecycle-kill-switch

From 3f1e0da to 3f1e0da

+0 / −0 lines
BeforeAfter
--- ---
schema: foundry-doc-v1 schema: foundry-doc-v1
title: "Spot VM Lifecycle — Single Controller and Kill Switch Pattern" title: "Spot VM Lifecycle — Single Controller and Kill Switch Pattern"
slug: spot-vm-lifecycle-kill-switch slug: spot-vm-lifecycle-kill-switch
category: architecture category: architecture
type: topic type: topic
status: stable status: stable
bcsc_class: no-disclosure-implication bcsc_class: no-disclosure-implication
last_edited: 2026-06-11 last_edited: 2026-06-11
editor: pointsav-engineering editor: pointsav-engineering
paired_with: spot-vm-lifecycle-kill-switch.es.md paired_with: spot-vm-lifecycle-kill-switch.es.md
--- ---
When an automated pipeline depends on a preemptible or spot VM, the lifecycle of that VM When an automated pipeline depends on a preemptible or spot VM, the lifecycle of that VM
must be owned by a single controller. Two independent timers that each hold the authority must be owned by a single controller. Two independent timers that each hold the authority
to start the VM will eventually fire at the same time, leaving the VM running between to start the VM will eventually fire at the same time, leaving the VM running between
cycles at full cost with no automated stop path. This document describes the single-controller cycles at full cost with no automated stop path. This document describes the single-controller
architecture used for the Yo-Yo batch node and the sentinel file kill switch that provides architecture used for the Yo-Yo batch node and the sentinel file kill switch that provides
immediate operator control. immediate operator control.
## The two-timer problem ## The two-timer problem
The Yo-Yo batch pipeline initially had two timers operating independently: The Yo-Yo batch pipeline initially had two timers operating independently:
- `local-yoyo-daily.timer` — ran the daily enrichment cycle, which started and stopped the VM - `local-yoyo-daily.timer` — ran the daily enrichment cycle, which started and stopped the VM
- `local-corpus-threshold.timer` — checked the training corpus and started the VM if the threshold was exceeded - `local-corpus-threshold.timer` — checked the training corpus and started the VM if the threshold was exceeded
Both timers called `gcloud instances start`. Only the daily cycle timer called `gcloud instances stop`. Both timers called `gcloud instances start`. Only the daily cycle timer called `gcloud instances stop`.
When `local-corpus-threshold.timer` fired, it could start the VM but had no path to stop it. When `local-corpus-threshold.timer` fired, it could start the VM but had no path to stop it.
If the daily cycle timer did not fire shortly afterward, the VM would remain running indefinitely. If the daily cycle timer did not fire shortly afterward, the VM would remain running indefinitely.
At the Yo-Yo node's cost of approximately $0.71 per hour, an uncapped start event from At the Yo-Yo node's cost of approximately $0.71 per hour, an uncapped start event from
the threshold timer would cost approximately $0.85 before the next daily cycle fired to the threshold timer would cost approximately $0.85 before the next daily cycle fired to
stop it — assuming the cycle fired at all. If the cycle was skipped due to a holiday or stop it — assuming the cycle fired at all. If the cycle was skipped due to a holiday or
a kill switch being active, the VM could run for 24 hours or more at a cost of a kill switch being active, the VM could run for 24 hours or more at a cost of
approximately $17. approximately $17.
## The single-controller fix ## The single-controller fix
The fix is architectural: exactly one systemd unit owns the full VM lifecycle for each VM. The fix is architectural: exactly one systemd unit owns the full VM lifecycle for each VM.
`local-corpus-threshold.timer` was masked (redirected to `/dev/null`), removing its `local-corpus-threshold.timer` was masked (redirected to `/dev/null`), removing its
ability to start the VM. All VM lifecycle operations — start, enrich, check threshold, ability to start the VM. All VM lifecycle operations — start, enrich, check threshold,
optionally train, stop, verify — are now performed within a single invocation of optionally train, stop, verify — are now performed within a single invocation of
`yoyo-daily-cycle.sh` triggered by `local-yoyo-daily.timer`. `yoyo-daily-cycle.sh` triggered by `local-yoyo-daily.timer`.
The corpus threshold check is now Phase 5 inside the daily cycle rather than a separate The corpus threshold check is now Phase 5 inside the daily cycle rather than a separate
timer. The training trigger is Phase 6. Both run while the VM is already running for timer. The training trigger is Phase 6. Both run while the VM is already running for
enrichment, adding no additional VM start cost. enrichment, adding no additional VM start cost.
The rule generalises: for any spot VM that performs multiple automated tasks, consolidate The rule generalises: for any spot VM that performs multiple automated tasks, consolidate
all tasks into a single orchestrator script invoked by a single timer. Do not give all tasks into a single orchestrator script invoked by a single timer. Do not give
multiple timers start authority over the same VM. multiple timers start authority over the same VM.
## The sentinel file kill switch ## The sentinel file kill switch
A kill switch is a file whose presence or absence controls whether an automated process A kill switch is a file whose presence or absence controls whether an automated process
runs. The pattern is: runs. The pattern is:
``` ```
presence of /path/to/flag-file → suppress the operation presence of /path/to/flag-file → suppress the operation
absence of /path/to/flag-file → normal operation absence of /path/to/flag-file → normal operation
``` ```
For the Yo-Yo batch node, the kill switch file is `/srv/foundry/data/yoyo-disabled`. For the Yo-Yo batch node, the kill switch file is `/srv/foundry/data/yoyo-disabled`.
The daily cycle script checks for this file as its first action (Phase 0), before issuing The daily cycle script checks for this file as its first action (Phase 0), before issuing
any `gcloud` commands: any `gcloud` commands:
```bash ```bash
if [[ -e "$KILL_SWITCH" ]]; then if [[ -e "$KILL_SWITCH" ]]; then
log "KILL SWITCH ACTIVE — $KILL_SWITCH present; aborting all VM lifecycle" log "KILL SWITCH ACTIVE — $KILL_SWITCH present; aborting all VM lifecycle"
exit 0 exit 0
fi fi
``` ```
Creating the file is a one-command action that takes effect on the next timer firing: Creating the file is a one-command action that takes effect on the next timer firing:
```bash ```bash
touch /srv/foundry/data/yoyo-disabled touch /srv/foundry/data/yoyo-disabled
``` ```
Removing the file resumes normal operation: Removing the file resumes normal operation:
```bash ```bash
rm /srv/foundry/data/yoyo-disabled rm /srv/foundry/data/yoyo-disabled
``` ```
The pattern is appropriate for any automated process where: The pattern is appropriate for any automated process where:
- The operator needs an instant brake that survives a reboot - The operator needs an instant brake that survives a reboot
- The suppression should be persistent across multiple timer firings until explicitly reversed - The suppression should be persistent across multiple timer firings until explicitly reversed
- No service restart or configuration change should be required to activate or deactivate control - No service restart or configuration change should be required to activate or deactivate control
An environment variable (`export SUPPRESS=true`) would not survive a reboot or a service An environment variable (`export SUPPRESS=true`) would not survive a reboot or a service
restart. A systemd unit mask requires root and a `daemon-reload`. The sentinel file restart. A systemd unit mask requires root and a `daemon-reload`. The sentinel file
approach is reversible, auditable (its presence or absence is visible with `ls`), and approach is reversible, auditable (its presence or absence is visible with `ls`), and
requires no elevated privileges to activate. requires no elevated privileges to activate.
## Defense in depth: the idle monitor ## Defense in depth: the idle monitor
The kill switch prevents starts. A separate safety layer stops a VM that is running when The kill switch prevents starts. A separate safety layer stops a VM that is running when
it should not be. The idle monitor timer (`yoyo-idle-monitor.timer`) fires every five it should not be. The idle monitor timer (`yoyo-idle-monitor.timer`) fires every five
minutes and checks whether the Yo-Yo batch VM has been running for more than 30 minutes minutes and checks whether the Yo-Yo batch VM has been running for more than 30 minutes
without an active inference request. If that condition is met, the monitor issues a stop without an active inference request. If that condition is met, the monitor issues a stop
command. command.
The idle monitor is a backstop, not the primary controller. Its role is to bound the cost The idle monitor is a backstop, not the primary controller. Its role is to bound the cost
exposure if the daily cycle fails to complete its stop sequence — for example, if the exposure if the daily cycle fails to complete its stop sequence — for example, if the
workspace VM loses connectivity during Phase 8, or if the cycle is interrupted by a workspace VM loses connectivity during Phase 8, or if the cycle is interrupted by a
process signal before the stop command is issued. process signal before the stop command is issued.
The combination of single-controller daily cycle, sentinel file kill switch, and idle The combination of single-controller daily cycle, sentinel file kill switch, and idle
monitor provides three independent layers: monitor provides three independent layers:
1. The daily cycle stops the VM as its final phase (intended path) 1. The daily cycle stops the VM as its final phase (intended path)
2. The idle monitor stops the VM if the cycle fails (first backstop) 2. The idle monitor stops the VM if the cycle fails (first backstop)
3. The kill switch prevents the VM from starting if the operator needs to pause all 3. The kill switch prevents the VM from starting if the operator needs to pause all
activity (operator override at Phase 0) activity (operator override at Phase 0)
## The corpus-threshold.py guard ## The corpus-threshold.py guard
`corpus-threshold.py` contains a `_start_trainer_vm()` function that was originally called `corpus-threshold.py` contains a `_start_trainer_vm()` function that was originally called
by the corpus threshold timer. After the timer was masked, this function was modified to by the corpus threshold timer. After the timer was masked, this function was modified to
check the kill switch file before issuing any `gcloud instances start` command. This is a check the kill switch file before issuing any `gcloud instances start` command. This is a
defense-in-depth measure: if the function is ever called from a code path that bypasses defense-in-depth measure: if the function is ever called from a code path that bypasses
the daily cycle, the kill switch still takes effect. the daily cycle, the kill switch still takes effect.
The guard pattern: The guard pattern:
```python ```python
if os.path.exists(KILL_SWITCH_PATH): if os.path.exists(KILL_SWITCH_PATH):
print(f"[kill switch] {KILL_SWITCH_PATH} present — VM start suppressed") print(f"[kill switch] {KILL_SWITCH_PATH} present — VM start suppressed")
return return
``` ```
Any script that has the authority to start a spot VM should implement this check. Any script that has the authority to start a spot VM should implement this check.
## Applying the pattern ## Applying the pattern
To apply single-controller + kill switch to any spot VM pipeline: To apply single-controller + kill switch to any spot VM pipeline:
1. Identify all timers and scripts that call `gcloud instances start` for the VM. 1. Identify all timers and scripts that call `gcloud instances start` for the VM.
2. Consolidate all work into a single orchestrator script. The script starts the VM, 2. Consolidate all work into a single orchestrator script. The script starts the VM,
performs all tasks in sequence, and stops the VM as its final step. performs all tasks in sequence, and stops the VM as its final step.
3. Disable all other start paths (mask the timers; modify any scripts that had start 3. Disable all other start paths (mask the timers; modify any scripts that had start
authority to check the kill switch file instead). authority to check the kill switch file instead).
4. Create the kill switch file path in a directory that survives reboots 4. Create the kill switch file path in a directory that survives reboots
(e.g. `/srv/foundry/data/` or `/var/lib/`). (e.g. `/srv/foundry/data/` or `/var/lib/`).
5. Add the kill switch check as the first statement in the orchestrator script. 5. Add the kill switch check as the first statement in the orchestrator script.
6. Add an idle monitor as a cost backstop, targeting the specific VM name and zone. 6. Add an idle monitor as a cost backstop, targeting the specific VM name and zone.