Skip to main content

Study and Experiment Configuration

llenergymeasure uses YAML files to define experiments and studies. llem run auto-detects whether a YAML file is an experiment (single run) or a study (sweep or multi-experiment run) by inspecting its top-level keys. Files with a sweep: or experiments: key are loaded as studies; all others are loaded as single experiments.

Single Experiment

The minimal experiment YAML requires only model::

model: gpt2
engine: transformers

A fuller example with sub-configs:

model: gpt2
engine: transformers
dtype: bfloat16
max_input_tokens: 256
max_output_tokens: 256

dataset:
source: aienergyscore
n_prompts: 100
order: interleaved

decoder:
preset: deterministic # greedy decoding

transformers:
batch_size: 4
attn_implementation: sdpa

warmup:
enabled: true
n_warmup: 5

baseline:
enabled: true
duration_seconds: 30

energy_sampler: auto
gpu_telemetry: true

A vLLM experiment (requires Docker runner):

model: gpt2
engine: vllm

dataset:
source: aienergyscore
n_prompts: 100

runners:
vllm: docker

vllm:
engine:
enforce_eager: false
gpu_memory_utilization: 0.9
kv_cache_dtype: auto
sampling:
max_tokens: 128

Run either with:

llem run experiment.yaml

Study / Sweep

A study runs multiple experiments with different configurations. There are two ways to define the experiment set:

  • sweep: — defines a grid of parameter values. Supports two entry types:
    • Independent axes (list of scalars) — Cartesian product across all axes.
    • Dependent groups (list of dicts) — alternatives within a group, crossed with other groups and axes. Use for parameters with mutual exclusivity or co-dependencies.
  • experiments: — explicit list of experiment configs. Each entry is merged with any top-level shared fields.

Both can be combined: sweep configs are produced first, then explicit entries are appended.

The study YAML also accepts a base: key pointing to a base experiment config file, and a study_execution: block controlling how many cycles to run and in what order.

Sweep Grammar

Single dimension sweep

# 2 configs: dtype=float16 and dtype=bfloat16
name: dtype-sweep
model: gpt2
engine: transformers

dataset:
n_prompts: 100

sweep:
dtype: [float16, bfloat16]

study_execution:
n_cycles: 3
experiment_order: shuffle

Run with llem run dtype-sweep.yaml. Produces 2 configs × 3 cycles = 6 runs.


Multi-dimension sweep (Cartesian product)

# 4 configs: fp16+50, fp16+100, bf16+50, bf16+100
name: dtype-n-sweep
model: gpt2
engine: transformers

sweep:
dtype: [float16, bfloat16]
dataset.n_prompts: [50, 100]

study_execution:
n_cycles: 3
experiment_order: shuffle

Produces 4 configs × 3 cycles = 12 runs.


Engine-scoped sweep (2-segment path)

Use dotted paths (engine.param) to sweep a Engine-specific parameter:

# 4 configs: batch_size 1, 2, 4, 8 — all with engine=transformers
name: batch-size-sweep
model: gpt2
engine: transformers

sweep:
pytorch.batch_size: [1, 2, 4, 8]

study_execution:
n_cycles: 3
experiment_order: shuffle

Produces 4 configs × 3 cycles = 12 runs. The transformers.batch_size path expands to a pytorch: { batch_size: N } section in each generated experiment config.


Engine-scoped sweep (3-segment path)

For nested engine configs (e.g. vLLM's engine sub-section):

# 6 configs: 3 block_sizes × 2 kv_cache_dtypes
name: kv-cache-sweep
model: gpt2
engine: vllm

sweep:
vllm.engine.block_size: [8, 16, 32]
vllm.engine.kv_cache_dtype: [auto, fp8]

runners:
vllm: docker

study_execution:
n_cycles: 3

Produces 6 configs × 3 cycles = 18 runs. The path vllm.engine.block_size expands to vllm: { engine: { block_size: N } }.


Nested config sweep (dotted path)

Use dotted paths for nested config fields like dataset.n_prompts or dataset.source:

name: dataset-size-sweep
model: gpt2
engine: transformers

sweep:
dataset.n_prompts: [50, 100, 200]

study_execution:
n_cycles: 3

Produces 3 configs × 3 cycles = 9 runs. The dotted path dataset.n_prompts expands to dataset: { n_prompts: N } in each generated experiment config.

Note: Dotted paths starting with an engine name (e.g. transformers.batch_size, vllm.engine.max_num_seqs) are treated as engine-scoped parameters. All other dotted paths (e.g. dataset.n_prompts, dataset.order) are treated as nested config fields.


Explicit experiments list

Use experiments: when the configurations are not a regular grid:

# 2 explicit configs
name: compare-engines

dataset:
n_prompts: 50

experiments:
- model: gpt2
engine: transformers
dtype: bfloat16
- model: gpt2
engine: vllm
runners:
vllm: docker

study_execution:
n_cycles: 3
experiment_order: interleave

Each entry is merged with any top-level shared fields (dataset.n_prompts: 50 here).


Base inheritance

Use base: to load a base experiment config file and sweep on top of it:

# base-experiment.yaml is a normal experiment YAML
name: dtype-sweep
base: base-experiment.yaml

sweep:
dtype: [float32, float16, bfloat16]

study_execution:
n_cycles: 3

The base file is loaded, study-only keys (sweep, experiments, study_execution, base, name, runners) are stripped, and the remaining fields become the starting point for all generated configs. Inline fields in the study YAML override base fields. Path is resolved relative to the study YAML file's directory.


Mixed sweep and explicit

Both sweep: and experiments: can appear in the same study. Sweep-generated configs come first, then explicit entries are appended:

# 2 sweep configs + 1 explicit config = 3 total
name: mixed-study
model: gpt2

sweep:
dtype: [float16, bfloat16]

experiments:
- model: gpt2
engine: transformers
dtype: float32
transformers:
load_in_4bit: true

study_execution:
n_cycles: 3
experiment_order: shuffle

Dependent groups (sweep groups)

Some parameters have constraints: torch_compile_mode only makes sense when torch_compile: true, quantisation sub-params like bnb_4bit_quant_type require load_in_4bit: true, and beam search requires do_sample: false. A plain Cartesian sweep would produce invalid combinations. Dependent groups solve this by bundling constrained parameters into named groups of alternative variants.

Type-based disambiguation: a list of scalars is an independent axis; a list of dicts is a dependent group. Groups are crossed with each other and with independent axes, but entries within a group are alternatives (unioned, not crossed).

# 6 configs: 2 dtype × 3 compilation variants
name: compile-sweep
model: gpt2
engine: transformers

sweep:
dtype: [float16, bfloat16] # independent axis (2 values)
pytorch.compilation: # dependent group (3 variants)
- pytorch.torch_compile: false
- pytorch.torch_compile: true
pytorch.torch_compile_mode: default
- pytorch.torch_compile: true
pytorch.torch_compile_mode: max-autotune

The group name (transformers.compilation) is an abstract label - it doesn't map to a config field. Keys within each variant dict are fully-qualified dotted paths, routed the same way as independent axis keys.


Baseline variant ({})

Use an empty dict {} as a group entry to include a "no override" baseline:

sweep:
pytorch.quantization:
- {} # baseline: no quantisation
- pytorch.load_in_8bit: true
- pytorch.load_in_4bit: true
pytorch.bnb_4bit_quant_type: nf4

Produces 3 variants: unquantised baseline, 8-bit, and 4-bit.


Mini-grids within group entries

A group entry can contain list-valued fields (list of scalars), which expand as a mini Cartesian product within that entry:

sweep:
pytorch.caching:
- {} # baseline
- pytorch.use_cache: true
pytorch.cache_implementation: [static, offloaded_static, sliding_window]

Produces 4 variants: 1 baseline + 3 cache implementations (all with use_cache: true).


Cross-section overrides

Group entries can override fields outside their engine section, such as decoder settings:

sweep:
pytorch.decoding:
- {} # baseline: use shared decoder settings
- decoder.do_sample: false
decoder.temperature: 0.0
pytorch.num_beams: 4
pytorch.early_stopping: true

YAML anchors for repetition reduction

Use YAML anchors (&name) and merge keys (<<: *name) to avoid repeating shared fields across group variants:

sweep:
tensorrt.quant_config:
- {} # baseline: no quantisation
- &trt_int8
tensorrt.quant_config.quant_algo: INT8
- <<: *trt_int8
tensorrt.quant_config.kv_cache_quant_algo: INT8
- tensorrt.quant_config.quant_algo: W4A16_AWQ

Multiple groups crossed

When a sweep has multiple groups, they are crossed with each other (and with independent axes):

# 2 dtype × 2 compilation × 3 quantisation = 12 configs
sweep:
dtype: [float16, bfloat16]
pytorch.compilation:
- pytorch.torch_compile: false
- pytorch.torch_compile: true
pytorch.torch_compile_mode: default
pytorch.quantization:
- {}
- pytorch.load_in_8bit: true
- pytorch.load_in_4bit: true
pytorch.bnb_4bit_quant_type: nf4

Engine-scoped groups

Groups with an engine-prefixed name (e.g. transformers.compilation, vllm.decoding) only apply to that engine's experiments. Universal groups (no engine prefix) apply to all engines.

Collision rule: A group name must not match an independent axis key. Use abstract names like transformers.compilation (not transformers.torch_compile) to avoid collisions.


Execution Configuration

The study_execution: section controls cycle repetition and ordering:

study_execution:
n_cycles: 3
experiment_order: shuffle # sequential | interleave | shuffle | reverse | latin_square

n_cycles — how many times the full experiment list is repeated. Repeated execution reduces measurement variance.

experiment_order — controls execution order across cycles. For experiments A and B with 3 cycles each:

OrderSequenceWhen to use
sequentialA, A, A, B, B, BThermal isolation between experiments
interleaveA, B, A, B, A, BReduces temporal bias; fair comparison
shufflerandom per-cycle, seededPublication-quality; eliminates ordering bias
reverseA, B, B, A, A, BDetects ordering effects via counterbalancing
latin_squareWilliams design rowsBalances first-order carryover effects

shuffle order is seeded from the study design hash, so the same study always shuffles identically — reruns are reproducible. Override with an explicit shuffle_seed:

study_execution:
experiment_order: shuffle
shuffle_seed: 123 # null = derived from study_design_hash

Note: shuffle_seed (study-level scheduling) and random_seed (per-experiment inference/dataset RNG) are independent by design. Changing one does not affect the other. See Methodology — Seeding model for details.

CLI effective defaults when running llem run study.yaml (if not set in YAML):

  • n_cycles = 3
  • experiment_order = shuffle

Override with llem run study.yaml --cycles 5 --order interleave.

Robustness Controls

The study_execution: section also supports circuit breaker and timeout fields:

study_execution:
max_consecutive_failures: 10 # 0 = disabled, 1 = fail-fast
circuit_breaker_cooldown_seconds: 60.0
wall_clock_timeout_hours: 24 # null = no limit

max_consecutive_failures - circuit breaker threshold. After N consecutive experiment failures, the study aborts and marks remaining experiments as skipped. The circuit breaker follows a 3-state pattern (closed/open/half-open): after tripping, it pauses for the cooldown period, then runs one probe experiment. If the probe succeeds, normal execution resumes; if it fails, the study aborts.

  • 0 disables the circuit breaker entirely (equivalent to --no-circuit-breaker)
  • 1 aborts on the first failure with no cooldown (equivalent to --fail-fast)
  • Default: 10

circuit_breaker_cooldown_seconds - pause duration before the half-open probe experiment. Allows transient issues (GPU thermal throttling, OOM recovery) to resolve before retrying. Default: 60.0.

wall_clock_timeout_hours - hard time limit for the entire study. When the timeout expires, remaining experiments are marked as skipped and the study status is set to timed_out. The manifest preserves all completed results. Default: null (no limit).

CLI flags --fail-fast, --no-circuit-breaker, and --timeout override these settings.

GPU Lock Files

Studies acquire advisory file locks on GPU devices to prevent concurrent studies from competing for the same GPUs. Locks are acquired atomically (all-or-none) and released automatically on process exit, including after SIGKILL. Disable with --no-lock if your environment handles GPU scheduling externally.

Study Resume

Interrupted studies (Ctrl-C, timeout, circuit breaker) can be resumed:

llem run study.yaml --resume # auto-detect most recent
llem run study.yaml --resume-dir PATH # specific study directory

Resume skips completed experiments and re-runs failed, skipped, pending, and interrupted ones. Config drift (changed sweep axes or model) raises a hard error to prevent mixing results from different configurations.


Runner Configuration

The runners: section determines how each engine executes:

runners:
transformers: local # run on host
vllm: docker # use default image
vllm: "docker:ghcr.io/custom/vllm:latest" # explicit image
ValueBehaviour
localRun directly on the host (all dependencies must be installed)
dockerRun in a container using the default image for that engine
docker:<image>Run in a container using the specified image

When docker is used without an explicit image tag, the image is resolved from the installed package version using the template ghcr.io/henrycgbaker/llenergymeasure/{engine}:v{version}. For example, with llenergymeasure==0.9.0 and engine=vllm, the image ghcr.io/henrycgbaker/llenergymeasure/vllm:v0.9.0 is pulled automatically.

See Docker Setup for image pull behaviour and pre-fetching.


Config Reference

Configuration Reference

Full reference for all ExperimentConfig fields. All fields except model are optional and have sensible defaults.

Sections:

Top-Level Fields

FieldTypeDefaultDescription
modelstring(required)HuggingFace model ID or local path
engine'pytorch''vllm''tensorrt'
datasetDatasetConfig(see below)Dataset configuration (nested sub-object)
dtype'float32''float16''bfloat16'
random_seedinteger42Per-experiment seed: inference RNG and dataset ordering
max_input_tokensintegerNone256
max_output_tokensintegerNone256
decoderDecoderConfig(see section)Universal decoder/generation configuration
warmupWarmupConfig(see section)Warmup phase configuration
baselineBaselineConfig(see section)Baseline power measurement configuration
energy_sampler'auto''nvml''zeus'
gpu_telemetrybooleantruePersist GPU power/thermal/memory timeseries to Parquet sidecar. NVML always runs for throttle detection; this controls disk output.
pytorchTransformersConfigNonenull
vllmVLLMConfigNonenull
tensorrtTensorRTConfigNonenull
loraLoRAConfigNonenull
passthrough_kwargsdictNonenull
output_dirstringNonenull

Dataset (dataset:)

The dataset: section configures which prompts to use and how they are loaded.

FieldTypeDefaultDescription
sourcestringaienergyscoreDataset source: built-in alias (e.g. aienergyscore) or .jsonl file path
n_promptsinteger100Number of prompts to load
order'interleaved''grouped''shuffled'

Examples:

# Built-in dataset (default)
dataset:
source: aienergyscore
n_prompts: 100

# Custom JSONL file
dataset:
source: ./my-prompts.jsonl
n_prompts: 500
order: shuffled

Decoder / Sampling (decoder:)

FieldTypeDefaultDescription
temperaturenumber1.0Sampling temperature (0=greedy)
do_samplebooleantrueEnable sampling (ignored if temp=0)
top_kinteger50Top-k sampling (0=disabled)
top_pnumber1.0Top-p nucleus sampling (1.0=disabled)
repetition_penaltynumber1.0Repetition penalty (1.0=no penalty)
min_pnumberNonenull
min_new_tokensintegerNonenull
preset'deterministic''standard''creative'

Warmup (warmup:)

Two modes: fixed (default) runs exactly n_warmup prompts; CV convergence (opt-in via convergence_detection: true) runs until latency CV drops below threshold. CV mode replaces n_warmup - they are alternative modes, not additive. After warmup, thermal_floor_seconds wait lets GPU temperature plateau before measurement.

FieldTypeDefaultDescription
enabledbooleantrueEnable warmup phase
n_warmupinteger5Number of warmup prompts in fixed mode (ignored when convergence_detection=true)
thermal_floor_secondsnumber60.0Post-warmup thermal stabilisation wait in seconds. Minimum 30s enforced.
convergence_detectionbooleanfalseEnable CV-based adaptive warmup (replaces fixed n_warmup)
cv_thresholdnumber0.05CV target for convergence (stop when CV < this value)
max_promptsinteger20Safety cap on warmup prompts in CV mode
window_sizeinteger3Sliding window size for CV calculation
min_promptsinteger5Minimum prompts before checking convergence

Baseline (baseline:)

FieldTypeDefaultDescription
enabledbooleantrueEnable baseline power measurement
duration_secondsnumber30.0Baseline measurement duration in seconds (5-120s)
strategystring"validated"Caching strategy: validated (cached with periodic spot-check, default), cached (disk-persisted TTL), fresh (measure every experiment)
cache_ttl_secondsnumber7200.0How long a cached baseline remains valid before re-measurement, in seconds. Min 60s. Used with cached/validated strategies.
validation_intervalinteger5Re-validate baseline every N experiments. Used with validated strategy only.
drift_thresholdnumber0.10Power drift fraction (0.01-0.50) to trigger re-measurement. Used with validated strategy only.

Energy Sampler (energy_sampler:)

energy_sampler is a flat top-level field (not a nested section). See Energy Measurement for full details on backends, accuracy, and what the harness resolves internally.

ValueDescription
auto (default)Best available: Zeus > NVML > CodeCarbon
nvmlNVML power polling at 100ms intervals
zeusHardware energy counters (Volta+ GPUs). Most accurate. Install: pip install "llenergymeasure[zeus]"
codecarbonSystem-level (GPU+CPU+RAM). Install: pip install "llenergymeasure[codecarbon]"
nullDisable energy measurement (throughput-only mode)

GPU Telemetry (gpu_telemetry:)

gpu_telemetry controls whether the NVML power/thermal/memory timeseries is persisted to a Parquet sidecar file alongside the result JSON.

gpu_telemetry: true # default: write timeseries.parquet
gpu_telemetry: false # skip parquet output (saves disk for large studies)

What it controls: Whether timeseries.parquet is written to the output directory.

What it does not control: NVML telemetry is always collected during inference for throttle detection and measurement quality warnings, regardless of this setting. Setting gpu_telemetry: false only suppresses disk output.

The Parquet sidecar contains 1Hz downsampled data with 8 columns: timestamp_s, gpu_index, power_w, temperature_c, memory_used_mb, memory_total_mb, sm_utilisation_pct, throttle_reasons. File sizes are typically < 5KB per minute of inference per GPU.

See Energy Measurement for details on how NVML telemetry relates to energy measurement.

Transformers Engine (pytorch:)

FieldTypeDefaultDescription
batch_sizeintegerNonenull
attn_implementation'sdpa''flash_attention_2''flash_attention_3'
torch_compilebooleanNonenull
torch_compile_modestringNonenull
torch_compile_backendstringNonenull
load_in_4bitbooleanNonenull
load_in_8bitbooleanNonenull
bnb_4bit_compute_dtype'float16''bfloat16''float32'
bnb_4bit_quant_type'nf4''fp4'None
bnb_4bit_use_double_quantbooleanNonenull
use_cachebooleanNonenull
cache_implementation'static''offloaded_static''sliding_window'
num_beamsintegerNonenull
early_stoppingbooleanNonenull
length_penaltynumberNonenull
no_repeat_ngram_sizeintegerNonenull
prompt_lookup_num_tokensintegerNonenull
device_mapstringNonenull
max_memorydictNonenull
low_cpu_mem_usagebooleanNonenull
allow_tf32booleanNonenull
autocast_enabledbooleanNonenull
autocast_dtype'float16''bfloat16'None
tp_planstringNonenull
tp_sizeintegerNonenull

vLLM Engine (vllm.engine:)

FieldTypeDefaultDescription
gpu_memory_utilizationnumberNonenull
swap_spacenumberNonenull
cpu_offload_gbnumberNonenull
block_size81632
kv_cache_dtype'auto''fp8''fp8_e5m2'
enforce_eagerbooleanNonenull
enable_chunked_prefillbooleanNonenull
max_num_seqsintegerNonenull
max_num_batched_tokensintegerNonenull
max_model_lenintegerNonenull
tensor_parallel_sizeintegerNonenull
pipeline_parallel_sizeintegerNonenull
enable_prefix_cachingbooleanNonenull
quantization'awq''gptq''fp8'
num_scheduler_stepsintegerNonenull
max_seq_len_to_captureintegerNonenull
distributed_executor_backend'mp''ray'None
speculative_configVLLMSpeculativeConfigNonenull
offload_group_sizeintegerNonenull
offload_num_in_groupintegerNonenull
offload_prefetch_stepintegerNonenull
offload_paramslist[string]Nonenull
disable_custom_all_reducebooleanNonenull
kv_cache_memory_bytesintegerNonenull
compilation_configdictNonenull
attentionVLLMAttentionConfigNonenull

vLLM Sampling (vllm.sampling:)

FieldTypeDefaultDescription
max_tokensintegerNonenull
min_tokensintegerNonenull
presence_penaltynumberNonenull
frequency_penaltynumberNonenull
ignore_eosbooleanNonenull
nintegerNonenull
FieldTypeDefaultDescription
beam_widthintegerNonenull
length_penaltynumberNonenull
early_stoppingbooleanNonenull
max_tokensintegerNonenull

vLLM Attention (vllm.engine.attention:)

FieldTypeDefaultDescription
enginestringNonenull
flash_attn_versionintegerNonenull
flash_attn_max_num_splits_for_cuda_graphintegerNonenull
use_prefill_decode_attentionbooleanNonenull
use_prefill_query_quantizationbooleanNonenull
use_cudnn_prefillbooleanNonenull
disable_flashinfer_prefillbooleanNonenull
disable_flashinfer_q_quantizationbooleanNonenull
use_trtllm_attentionbooleanNonenull
use_trtllm_ragged_deepseek_prefillbooleanNonenull

TensorRT-LLM Engine (tensorrt:)

FieldTypeDefaultDescription
max_batch_sizeintegerNonenull
tensor_parallel_sizeintegerNonenull
pipeline_parallel_sizeintegerNonenull
max_input_lenintegerNonenull
max_seq_lenintegerNonenull
max_num_tokensintegerNonenull
dtype'float16''bfloat16'None
fast_buildbooleanNonenull
enginestringNonenull
engine_pathstringNonenull
quantTensorRTQuantConfigNonenull
kv_cacheTensorRTKvCacheConfigNonenull
schedulerTensorRTSchedulerConfigNonenull
samplingTensorRTSamplingConfigNonenull

User Config File

llenergymeasure reads per-user defaults from ~/.config/llenergymeasure/config.yaml (XDG base directory, detected via platformdirs). This file is optional — all settings have sensible defaults.

# ~/.config/llenergymeasure/config.yaml
output:
results_dir: ./results
model_cache_dir: ~/.cache/huggingface

runners:
transformers: local
vllm: docker # always use Docker for vLLM
tensorrt: docker # TensorRT-LLM requires Docker

measurement:
datacenter_pue: 1.0
carbon_intensity_gco2_kwh: 0.233

Run llem config to display the current effective configuration and check which engines are installed. Use llem config --verbose for detailed environment information.