Results schema

Reference for everything llem writes to disk after a measurement. Three artefacts ship per study: a per-experiment result.json, a study-level manifest.json, and an optional timeseries.parquet sidecar.

For a guided walkthrough of how to read these files (with worked examples), see How to interpret results. For the methodology behind each metric, see What we measure and Energy measurement.

Output layout

A study run produces a directory tree like this:

results/
└── <study-name>_<UTC-timestamp>/
    ├── manifest.json                            # study-level checkpoint + summary
    ├── 001_c0_<model>-<engine>_<hash>/          # one experiment cell
    │   ├── result.json                          # all metrics + resolved config
    │   ├── effective_config.json                # final config used (post-expansion)
    │   └── timeseries.parquet                   # GPU power/thermal/memory samples
    ├── 002_c0_.../
    ├── ...
    └── _study-artefacts/
        ├── equivalence_groups.json              # dedup equivalence groups
        └── baseline_cache_<key>.json            # per-engine baseline cache

<UTC-timestamp> is ISO-8601 (e.g. 2026-05-07T14-32-08). Cell directory names encode <NNN>_c<cycle>_<model>-<engine>_<config-hash> so they sort sensibly and you can tell sibling cycles apart at a glance.

`result.json` - per-experiment record

The scientific record. One JSON file per experiment cell. Schema version 4.0.

Identification

Field	Type	Description
`schema_version`	str	Result schema version (currently `"4.0"`)
`experiment_id`	str	Unique experiment identifier (`{model}_{YYYYMMDD_HHMMSS}` for single experiments; study-level cells inherit a richer per-cell identifier)
`measurement_config_hash`	str	SHA-256[:16] of `ExperimentConfig` with environment fields excluded; same hash -> logically identical experiments
`llenergymeasure_version`	str \| null	Package version that produced this result
`engine`	str	Inference engine: `transformers` \| `vllm` \| `tensorrt`
`engine_version`	str \| null	Engine library version (e.g. `4.57.0` for transformers)
`model_name`	str	Model name or HuggingFace path used

Measurement methodology

Field	Type	Description
`measurement_methodology`	`"total"` \| `"steady_state"` \| `"windowed"`	Which slice of the run produced the headline metrics
`warmup_excluded_samples`	int \| null	Number of warmup iterations run before the measurement window (from `warmup_result.iterations_completed`); `null` when no warmup result is available
`reproducibility_notes`	str	Free-text caveats (default mentions NVML accuracy +/-5 %, thermal drift)

Aggregate metrics

These are the run totals (post-warmup-exclusion when applicable).

Field	Type	Description
`total_tokens`	int	Total output tokens generated across all prompts
`total_energy_j`	float	Total GPU energy in joules (raw, no baseline subtraction)
`total_inference_time_sec`	float	Total wall-clock inference time
`avg_tokens_per_second`	float	Throughput: `total_tokens / total_inference_time_sec`
`avg_energy_per_token_j`	float	Energy per output token in joules

Per-token energy (millijoules)

Field	Type	Description
`mj_per_tok_total`	float \| null	Millijoules per token from raw (unadjusted) energy
`mj_per_tok_adjusted`	float \| null	Millijoules per token from baseline-adjusted energy. `null` when no baseline was measured. This is the right field for cross-experiment comparisons.

:::note Why adjusted beats total for comparisons mj_per_tok_adjusted subtracts idle GPU power before dividing by token count. Two experiments running on hardware with different idle power (or at different thermal states) will show a spurious difference in mj_per_tok_total even when inference is identical. See Energy measurement for the full reasoning. :::

FLOPs

total_flops is an estimate (not measurable directly during inference). The derived per-token / per-second fields are null when the divisor is zero.

Field	Type	Description
`total_flops`	float	Total FLOPs estimate for this experiment
`flops_per_output_token`	float \| null	FLOPs per decode token. `null` if `total_flops = 0` or `output_tokens = 0`
`flops_per_input_token`	float \| null	FLOPs per prefill token
`flops_per_second`	float \| null	FLOPs throughput (`total_flops / inference_time_sec`)

Baseline (idle GPU power)

Field	Type	Description
`baseline_power_w`	float \| null	Idle GPU power in watts, measured before this experiment
`energy_adjusted_j`	float \| null	Total energy minus `baseline_power_w x total_inference_time_sec`. The "net inference work" energy figure.
`energy_per_device_j`	list[float] \| null	Per-GPU energy breakdown (length = `num_processes`)

For the methodology that motivates baseline subtraction, see Methodology > Baseline power.

Extended efficiency metrics

extended_metrics is a nested object with five always-present sub-objects (memory, gpu_utilisation, batch, kv_cache, request_latency) plus two scalars (tpot_ms, token_efficiency_index). Every leaf is null when it cannot be computed for the engine/run; the harness fills what each engine can provide. latency_stats, steady_state_window, and warmup_excluded_samples live at the top level of result.json.

Field	Type	Description
`extended_metrics.tpot_ms`	float \| null	Time per output token (ITL mean). Populated only when `measurement.latency_profiling=true` (transformers via streamer; vLLM via decode-average ITL); `null` otherwise.
`extended_metrics.token_efficiency_index`	float \| null	Composite `throughput x tokens_per_joule x precision_factor`.
`extended_metrics.memory.model_memory_utilisation`	float \| null	Model weights / total VRAM (0-1).
`extended_metrics.memory.tokens_per_gb_vram`	float \| null	Output tokens per GB of peak VRAM.
`extended_metrics.memory.kv_cache_mb` / `kv_cache_memory_ratio`	float \| null	KV-cache size and its share of peak memory (vLLM only, when exposed).
`extended_metrics.gpu_utilisation.sm_utilisation_mean`	float \| null	Mean SM utilisation (0-100) over NVML samples.
`extended_metrics.gpu_utilisation.memory_bandwidth_utilisation`	float \| null	Mean memory-controller activity (0-100). NVML proxy: percent of time a read/write was issued, not achieved bandwidth.
`extended_metrics.batch.num_batches` / `effective_batch_size` / `batch_utilisation` / `padding_overhead`	int/float \| null	Static-batching efficiency. `null` for vLLM (continuous batching).
`extended_metrics.kv_cache.*`	float/int \| null	Prefix-cache hit rate and block occupancy (vLLM only).
`extended_metrics.request_latency.e2e_latency_{mean,median,p95,p99}_ms`	float \| null	Per-request end-to-end latency distribution.
`latency_stats`	object \| null	TTFT/ITL statistics. vLLM populates TTFT-only stats on every run (engine-recorded first-token timestamps); ITL stats are added only when `measurement.latency_profiling=true`. transformers populates `latency_stats` (TTFT + ITL) only under profiling. Always `null` for tensorrt.
`latency_stats.measurement_mode`	str	Provenance of the latency capture: `true_streaming` (real per-token / first-token timestamps), `proportional` (decode-average ITL estimate, vLLM under profiling), or `per_request_batch`. The mode reflects the weakest signal present.
`steady_state_window`	[float, float] \| null	`(0.0, inference_time_sec)` - the measured window relative to inference start.

Per-engine support matrix

A check means the engine populates the field in the single-process path; a dash means it stays null for that engine.

Metric group	vLLM	transformers	tensorrt
`request_latency.*` (per-request E2E)	yes (from RequestOutput metrics)	yes (per-batch approximation)	dash (metrics usually absent in 0.21.0)
`latency_stats` TTFT	yes (always-on)	profiling only	dash
`latency_stats` ITL / `tpot_ms`	profiling only (`proportional`)	profiling only (`true_streaming`)	dash (unsupported)
`kv_cache.*`	yes (best-effort)	dash	dash
`gpu_utilisation.*` (SM + mem-bw)	yes	yes	yes
`memory.*` ratios	yes	yes	yes
`batch.*` (num_batches/padding/utilisation)	dash (continuous batching)	yes	`num_batches=1` only; padding/utilisation dash

Latency profiling is opt-in. Set measurement.latency_profiling: true to capture inter-token latency (and hence tpot_ms). Per-engine semantics:

transformers: a custom generation streamer records true per-token arrival times. Profiling forces batch_size=1 (one streamed token maps to one request) and is incompatible with beam search (num_beams > 1 falls back to the non-profiled path). Mode = true_streaming. With profiling off, latency_stats is null.
vLLM: TTFT comes from engine-recorded first-token timestamps and is populated on every run (mode true_streaming when only TTFT is present). Under profiling, a decode-average ITL is derived per request ((finished - first_token) / (n_out - 1)); because that averages over the decode phase rather than timing each token, the mode becomes proportional.
tensorrt: latency profiling is unsupported; the fields stay null and a warning is recorded in measurement_warnings.

Energy caveat. Per-token timing capture adds overhead that can perturb both energy and latency. Energy figures from a profiled run are emitted as-is and are not directly comparable to non-profiled runs; every profiled run records a disclaimer in measurement_warnings (the flag is also part of the config hash, so profiled and non-profiled runs are distinct experiments).

transformers non-profiled latency is approximated. Without profiling, a non-streaming generate() only exposes per-batch wall time, so each prompt in a batch is attributed batch_time / batch_size (the PER_REQUEST_BATCH mode in request_latency). This is an estimate, not a true per-request timestamp.

Sidecar reference

Field	Type	Description
`timeseries`	str \| null	Relative filename of the timeseries sidecar (e.g. `"timeseries.parquet"`); `null` when `output.save_timeseries: false`

Effective config (sibling file)

effective_config.json lives next to result.json in each experiment directory. It contains the fully resolved ExperimentConfig - every parameter value used, including engine defaults that were not explicitly specified. This is what reproduces the experiment.

`manifest.json` - study-level checkpoint

Written and updated as a study runs (resume support reads from it). Once the study completes, manifest's summary field is essentially the same as the returned StudyResult.summary.

Top-level

Field	Type	Description
`study_name`	str \| null	Study name (used in directory naming)
`study_design_hash`	str \| null	16-char SHA-256 of the resolved experiment list (execution block excluded). Same YAML -> same hash.
`start_time`	datetime	Study start (ISO-8601 UTC)
`end_time`	datetime	Study end (ISO-8601 UTC, populated on completion)
`experiments`	list[dict]	Per-experiment resolved config + status (running \| completed \| failed)
`summary`	`StudySummary`	Aggregate counters (see below)

`summary` block

Field	Type	Description
`total_experiments`	int	Total experiments planned for this study
`completed`	int	Number of successfully completed experiments
`failed`	int	Number of failed experiments
`total_wall_time_s`	float	Total wall-clock time in seconds
`total_energy_j`	float	Total energy across all experiments in joules
`unique_configurations`	int \| null	Distinct experiment configs: `total_experiments / n_cycles`
`warnings`	list[str]	Runtime warnings emitted during the study

`timeseries.parquet` - sample-level sidecar

Written when output.save_timeseries: true (the default). One Parquet file per experiment, columnar layout, suitable for direct loading into Pandas / Polars / DuckDB.

Column	Type	Description
`t`	float64	Wall-clock seconds since experiment start
`gpu_idx`	int32	GPU device index (0, 1, ...) for multi-GPU runs
`power_w`	float64	Instantaneous GPU power draw in watts
`temperature_c`	float64	GPU temperature in degC
`memory_used_mib`	float64	GPU memory used in MiB
`sm_clock_mhz`	float64	SM clock in MHz (when available)

LLenergyMeasure polls NVML at 100 ms intervals; thermal-throttle events shorter than the polling interval may be missed - see Methodology > Known limitations.

`StudyResult` - final return value (Python API)

Returned by run_study(...). Distinct from manifest.json: this is the fully-assembled object handed back to the caller after the study completes.

Field	Type	Description
`experiments`	list[`ExperimentResult`]	One entry per experiment cell (same fields as the per-experiment `result.json`)
`study_name`	str \| null	Same as manifest
`study_design_hash`	str \| null	Same as manifest
`measurement_protocol`	dict	Flat snapshot of `ExecutionConfig`: `n_cycles`, `experiment_order`, `experiment_gap_seconds`, `cycle_gap_seconds`, `shuffle_seed`, `experiment_timeout_seconds`
`result_files`	list[str]	Paths to per-experiment `result.json` files (paths, not embedded payload)
`summary`	`StudySummary`	Same shape as in the manifest
`skipped_experiments`	list[dict]	Grid points skipped due to validation errors. Each entry: `{raw_config, reason, errors}`

Loading from disk

import json
from pathlib import Path

study = Path("results/tutorial-multi-engine_2026-05-07T14-32-08")

# Load study manifest
with (study / "manifest.json").open() as f:
    manifest = json.load(f)

# Load every experiment result
results = []
for cell in sorted(study.glob("*/result.json")):
    with cell.open() as f:
        results.append(json.load(f))

# Load timeseries (Pandas)
import pandas as pd
ts = pd.read_parquet(study / "001_c0_qwen-transformers_a1b2c3" / "timeseries.parquet")

For the Python API equivalent (StudyResult object), see Reference > Library API.

Schema versioning

result.json.schema_version follows semantic versioning: minor bumps add fields without breaking existing readers, major bumps signal breaking changes. Pre-1.0 the policy is conservative - new fields land as Optional with default = null so existing parsers don't break.

Output layout​

result.json - per-experiment record​

Identification​

Measurement methodology​

Aggregate metrics​

Per-token energy (millijoules)​

FLOPs​

Baseline (idle GPU power)​

Extended efficiency metrics​

Per-engine support matrix​

Sidecar reference​

Effective config (sibling file)​

manifest.json - study-level checkpoint​

Top-level​

summary block​

timeseries.parquet - sample-level sidecar​

StudyResult - final return value (Python API)​

Loading from disk​

Schema versioning​

See also​