Skip to main content

Pipeline Architecture

This doc is the chain-diagram reference for the engine-coupling pipeline. SSOT-driven Renovate cycles flow through these stages: trigger, two per-concern CI workflows, validated artefacts, and the human curation checkpoint before merge.

Asymmetric engine architecture (locked design choice)

The three engines run different pipelines in CI for a load-bearing reason. Don't undo this asymmetry without re-reading #518 — the conclusion has held across re-litigations 2026-04-30, 2026-05-01, and 2026-05-05.

EngineImage sourceCI flow on PR
transformersFirst-party docker/Dockerfile.transformers (FA3-included; no upstream provides this)engine-pipeline :: build-transformers (rebuild) → engine-pipeline :: invariants-transformers + schemas-transformers (probe + mine/introspect) → [merge] → publish-engine-image (mirror to production tag)
vllmUpstream vllm/vllm-openai:v<VER> directly + bind-mount llem sourceengine-pipeline :: invariants-others + schemas-others matrix cells fire on pull_request: paths (no first-party build)
tensorrtUpstream nvcr.io/nvidia/tensorrt-llm/release:<VER> directly + bind-mount llem sourceSame shape as vllm

Why asymmetric. vllm + tensorrt's upstream images empirically contain everything llem needs at runtime (PoC verified 2026-04-30: pydantic, typer, pyarrow, rich, dotenv, pyyaml all present transitively). Transformers' upstream images don't include FA3, which is non-negotiable for production-equivalent CI runs. So transformers gets a first-party Dockerfile; the others stay upstream-direct.

Drift safety. The only argument for first-party-everywhere is "what if upstream drops a transitive dep llem needs?" The migration cost from upstream-direct → first-party is bounded (~1 day, well-defined recipe per #518). The actual cost of running first-party-everywhere is the FA3 build for two extra engines that don't need it.

Transformers PR-time CI flow (rebuild + probe/mine/introspect chain)

PR opens (touches transformers paths: SSOT, Dockerfile, miner code, etc.)

│ engine-pipeline.yml fires (paths trigger:
│ engine_versions/transformers.yaml, docker/Dockerfile.transformers,
│ .github/workflows/engine-pipeline.yml)

[Build transformers runtime image; cache hits ~10-15 min, cold FA3 ~60-90 min]
[Push to ghcr.io/<repo>/transformers-cache:transformers-<VER>]

│ Orchestrator's needs-graph fires the next stage on build success

engine-pipeline.yml :: invariants-transformers + schemas-transformers cells run:
pull transformers-cache image → probe → mine/introspect → validate → upload writeback artefact

│ Probe-fail → CI red. The 'accept-probe-fail' PR label bypasses
│ the gate for known-drift cases (admin escalation; see #547).

[CI green/red. PR ready to merge when green.]

│ PR merges to main (push event with SSOT/Dockerfile change)

publish-engine-image.yml fires DIRECTLY on push (no rebuild):
Tag-copy via `docker buildx imagetools create`:
transformers-cache:transformers-<VER> → transformers:transformers-<VER>
→ transformers:latest
Registry-side metadata op only — seconds, no build infra.
Production image is bit-identical to the cache image validated
by CI on the PR that just merged.

vllm + tensorrt PR-time CI flow (no rebuild; upstream-direct)

The diagram below applies to vllm + tensorrt only — engine-pipeline.yml's invariants-others + schemas-others matrix cells fire on pull_request: paths (no build-transformers dependency). They pull the upstream image at the SSOT-pinned version, bind-mount llem source, and probe/mine/introspect inside the upstream container.

================================================================================
LLenergyMeasure Engine-Coupling Pipeline (vllm + tensorrt)
Per-concern workflows (engine-invariants + engine-schemas) with sibling
coordination via wait-on-check-action.
================================================================================

LEGEND: [auto] fully automated, no human action
[chk] HUMAN CHECKPOINT — required dev input
[info] informational artefact, advisory
{ } input
[→...] automated transition

{Renovate scans upstream library releases on configured schedule}
│ [auto]

Custom regex manager bumps:
engine_versions/{engine}.yaml:library.current_version (SSOT — canonical)
docker/Dockerfile.{engine} ARG (derived; auto-templated from SSOT)
│ [auto]

{Renovate opens PR: "fix(deps): bump vllm to 0.10.2"}
│ [auto] path-filtered triggers fan out
▼ in PARALLEL to two workflows
┌────────────────────────────┴─────────────────────────────┐
▼ ▼
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ engine-pipeline.yml│ │ engine-pipeline.yml │
│ (per-engine matrix) │ │ (engines matrix) │
│ Layers over: invariant- │ │ Layers over: parameter- │
│ miner + invalidity-miner + │ │ discovery + typed-schema- │
│ lift modules + validation-CI │ │ discovery │
│ │ │ │
│ STEP 1 [auto]: PROBE (inline│ │ STEP 1 [auto]: PROBE (inline│
│ `python -m scripts._probe │ │ `python -m scripts._probe │
│ --producer invariants`) │ │ --producer schemas`) │
│ verdict: pass | fail │ │ verdict: pass | fail │
│ │ │ │
│ ── if probe == pass ── │ │ ── if probe == pass ── │
│ STEP 2 [auto]: MINE │ │ STEP 2 [auto]: DISCOVER │
│ build_corpus.py │ │ engine_introspectors │
│ → src/llenergymeasure/engines│ │ → src/llenergymeasure/ │
│ /{engine}/invariants.proposed.yaml │ │ config/discovered_ │
│ │ │ schemas/{engine}/schema.discovered.json │
│ STEP 3 [auto]: VENDOR-REPLAY│ │ │
│ validate_invariants.py + the │ │ STEP 3 [auto]: DIFF vs HEAD │
│ compare_expected_vs_ │ │ │
│ observed contract from │ │ STEP 4 [auto]: REGENERATE │
│ _invariant_validation_common.py│ │ docs/generated/ │
│ replays kwargs_positive + │ │ curation-{engine}.md │
│ kwargs_negative against │ │ (Parameters section — │
│ live library; classifies │ │ fact base for human │
│ outcomes (positive_ │ │ curator; pre-existing │
│ confirmed, negative_ │ │ behaviour preserved) │
│ confirmed, divergence) │ │ │
│ → src/llenergymeasure/engines│ │ STEP 5 [auto]: COMMENT │
│ /{engine}/invariants.validated.yaml │ │ + LABEL (suppress on empty)│
│ │ │ │
│ STEP 4 [auto]: DIFF vs HEAD │ │ ── if probe == fail ── │
│ for both proposed.yaml + │ │ Post probe-fail comment │
│ validated.yaml artefacts │ │ with 3 routes (per §3 of │
│ │ │ the design doc: patch │
│ STEP 5 [auto]: REGENERATE │ │ code / /approve-reuse / │
│ docs/generated/ │ │ escalate). Apply │
│ invariants-{engine}.md │ │ probe-blocked label. │
│ (Invariants section — fact │ │ exit 0 (not CI failure) │
│ base; encompasses dormancy │ │ │
│ + invalidity + miner │ │ │
│ output + introspection + │ │ │
│ runtime catch-all) │ │ │
│ │ │ │
│ STEP 6 [auto]: COMMENT │ │ │
│ + LABEL (suppress on empty)│ │ │
│ │ │ │
│ ── if probe == fail ── │ │ │
│ Same 3-route handling as │ │ │
│ schemas-pipeline above. │ │ │
│ Apply probe-blocked label. │ │ │
│ exit 0. │ │ │
└─────────────┬────────────────┘ └────────────┬─────────────────┘
│ │
│ Each workflow: │
│ - uploads engine-step-diff-{engine}-{concern}.yaml │
│ - posts its OWN per-pipeline comment (suppress on empty) │
│ - applies its own per-pipeline label │
│ (invariants/schemas-changed, invariants/schemas-breaking, │
│ corpus-changed, probe-blocked) │
│ - WAITS for sibling pipeline to complete │
│ (lewagon/wait-on-check-action; already-finished sibling │
│ exits immediately) │
│ - LAST-FINISHING workflow performs ATOMIC WRITEBACK in-line: │
│ git add src/llenergymeasure/engines/{engine}/invariants.proposed.yaml │
│ src/llenergymeasure/engines/{engine}/invariants.validated.yaml │
│ src/llenergymeasure/src/llenergymeasure/engines/ │
│ {engine}/schema.discovered.json │
│ docs/generated/curation-{engine}.md │
│ docs/generated/invariants-{engine}.md │
│ engine_versions/{engine}.compat.json │
│ engine_versions/{engine}.yaml (if /approve-reuse │
│ fired during cycle) │
│ git commit && git push --force-with-lease │
│ - LAST-FINISHING workflow applies cross-pipeline rollup label │
│ (safe-bump | probe-blocked) │
│ │
│ NO summariser workflow file. NO composite action. │
│ Cross-pipeline state lives on labels (GitHub-native primitive). │
│ "Did the cycle run?" = check-status badge. "Anything change?" │
│ = per-pipeline comments + commits. "What's the rollup state?" │
│ = label. │

┌────────────────────────────────────────────────┐
│ PR after a Renovate cycle: │
│ - 2 per-concern check statuses │
│ - up to 2 comments per cycle (suppress-on-empty):│
│ 1. engine-invariants pipeline │
│ 2. engine-schemas pipeline │
│ - 1 atomic bot commit (all artefacts; written │
│ by whichever workflow finished last) │
│ - cross-pipeline rollup label │
│ (safe-bump | probe-blocked) │
└─────────────────────┬──────────────────────────┘


╔═══════════════════════════════════════════════════════════════════╗
║ HUMAN CURATION CHECKPOINT [chk] ║
║ The only crossing of the human-as-final-checkpoint boundary (P6)║
║ inside the otherwise-automated validated half. Bots NEVER edit ║
║ src/llenergymeasure/config/engine_configs.py. ║
║ ║
║ Dev consumes auto-generated digests: ║
║ docs/generated/curation-{engine}.md ║
║ Section 1: Parameters (discovered fields × Pydantic-curated ║
║ yes/no, deltas vs previous SSOT version) ║
║ docs/generated/invariants-{engine}.md ║
║ Section 1: Invariants (corpus rules added/changed/removed, ║
║ classified by added_by; encompasses dormancy + ║
║ invalidity + miner output + introspection + ║
║ runtime catch-all) ║
║ ║
║ Dev manually edits engine_configs.py: ║
║ - which discovered params to expose in Pydantic ║
║ - which Literal narrowings to pin ║
║ - which sub-config taxonomy to use ║
║ - which custom @model_validator decorators to add ║
║ ║
║ Push -> triggers re-run of CI cycle -> updated summary comment ║
║ supersedes prior (edited via comment-id key, no proliferation) ║
║ ║
║ Decision routes after digest review: ║
║ safe-bump + green CI -> squash-merge ║
║ corpus-changed + mechanical -> squash-merge ║
║ invariants-breaking -> edit engine_configs.py ║
║ schemas-breaking -> edit engine_configs.py ║
║ probe-blocked -> resolve via §3 routes: ║
║ - patch producer code, OR ║
║ - /approve-reuse (slash cmd) ║
║ - escalate label ║
║ ║
║ GUIDED CURATION UX (RFC-style YAML decision file + libcst ║
║ applier) is DEFERRED to issue #475. Current redesign ships ║
║ self-serve curation only — devs hand-edit engine_configs.py ║
║ based on digest. After 2-3 Renovate cycles of operational data, ║
║ #475 reactivation will evaluate whether guided UX pays off. ║
╚═══════════════════════════════════════════════════════════════════╝


┌──────────────┐
│ squash-merge │
└──────┬───────┘

PR closes; engine version + all
validated artefacts + curated Pydantic
pinned together at this commit.

================================================================================
PROBE-FAIL HUMAN CHECKPOINT [chk]
The OTHER human touchpoint (per P6). Inside the otherwise-automated CI half.
================================================================================

When a probe fails (inline step 1 of either workflow), three resolution routes:

┌─ ROUTE 1 [chk → auto]: Patch producer code
│ Dev edits scripts/engine_miners/{engine}_*_miner.py or
│ scripts/engine_introspectors/{engine}_introspector.py to fix the
│ broken landmark (e.g. follow an upstream rename). Push commit ->
│ workflow re-runs -> probe re-runs -> if pass, downstream stages
│ proceed.

├─ ROUTE 2 [chk → auto]: Approve reuse via slash command
│ Dev posts `@llem-ci-bot /approve-reuse <engine> <producer>` as
│ PR comment. Producer ∈ {invariants, schemas} (per-producer
│ granularity — vllm invariants might be reusable while vllm
│ schemas are not).
│ │ [auto]
│ ▼
│ approve-reuse-bot.yml (issue_comment: created listener)
│ - Validates dev approval rights
│ - Updates engine_versions/{engine}.yaml miner_pins.{producer}
│ to widen SpecifierSet to include the bumped version
│ - Commits SSOT change via llem-ci-bot App token (cascades;
│ GITHUB_TOKEN would not)
│ │ [auto]
│ ▼
│ Probe re-runs against widened range -> verdict flips to PASS
│ -> downstream stages proceed

└─ ROUTE 3 [chk]: Escalate / block
Dev applies probe-blocked label. Renovate stops retrying this
bump until the label is removed; route 1 or 2 must follow before
merge.

NO OTHER SLASH COMMANDS. /rerun, /skip-probe, /force-merge explicitly
rejected as footguns. Deliberate scope: one binary approval gate per
(engine, producer), no escape hatches.

================================================================================
ADJACENT PIPELINES (independent of per-PR Renovate cycle)
================================================================================

engine-versions-sweep.yml {scheduled, e.g. weekly} [auto info]
└─ runs scripts/_probe.py over a curated version range
(e.g. vllm v0.9..v0.12); updates engine_versions/{engine}.compat.json
(probe cache + compat-matrix in one file; closes #470).
Populates probe-result cache so per-PR probes hit warm cache.

Runtime side-product {study runtime, NOT CI; study-local}
├─ runtime_observations.jsonl [info]
│ - Producer: src/llenergymeasure/study/runtime_observations.py
│ (warnings.catch_warnings + logger handler wrapping each worker
│ body); wired in runner.py
│ - Schema: schema_version=1; one record per (study_run_id,
│ config_hash, cycle); outcome ∈ {success, exception,
│ subprocess_died}
│ - Consumer (today): llem report-gaps (--source runtime-warnings,
│ the only wired source). Output: YAML fragment for manual
│ append to corpus (`# TODO: human` markers on placeholder
│ fields). PRESERVED as escape-hatch.
│ - Consumer (long-term): subsume into curation digest Section 3
│ ("Runtime gaps observed"). DEFERRED #475; reactivate after
│ 2-3 Renovate cycles of operational data.

└─ equivalence_groups.json [info]
- Detects observed_config_hash collisions across configs:
configs Pydantic distinguishes (resolved_config_hash differs)
but engine collapses (observed_config_hash matches). Flagged
as gap_detected: true -> dormancy signal.
- proposed_invariant_id field is currently always None; consumer
deferred until a researcher hits a real gap_detected: true
group and asks for tooling. Tracked in #405 + #474.
================================================================================

For the full design rationale (including the resolution of the per-engine vs per-concern split, the wait-for-sibling coordination decision, and the rejected summariser-workflow alternative), see the engine-coupling design discussion captured across PRs #477–#492.