Troubleshooting
Common issues and solutions for llenergymeasure.
Common Issues
No GPU detected
Symptom: llem config shows no GPU, or measurement fails with a CUDA error.
Cause: NVIDIA drivers are not installed, the device is not visible in the current environment, or the system is CPU-only.
Fix:
- Run
nvidia-smito verify the GPU is visible on the host. - Run
llem configto see what the tool detects. - If
nvidia-smiworks but the engine does not, you may be running outside a container that has CUDA — for vLLM and TensorRT-LLM, usellem run study.yamlwith Docker runners (see docker-setup.md). - If
nvidia-smifails, install or reinstall the NVIDIA drivers for your OS.
Engine not available on host
Symptom: llem run -e vllm ... fails immediately with an import error
mentioning the engine package.
Cause: Engines have no host install path — they run inside per-engine
Docker images. A host import of transformers, vllm, or tensorrt_llm
will always fail by design.
Fix: Build the engine image and dispatch via the Docker runner. The canonical pattern is in development.md; the short form is:
VER=$(yq '.library.current_version' engine_versions/transformers.yaml)
docker build -f docker/Dockerfile.transformers \
--build-arg TRANSFORMERS_VERSION="$VER" \
-t llenergymeasure:transformers-${VER} .
Replace transformers with vllm or tensorrt (and add --gpus all for
those two) for the other engines. Then run llem run with a Docker runner
configured for the engine — see docker-setup.md.
Run llem config to see the current status of each engine.
Docker pre-flight failed
Symptom: llem run study.yaml exits early with a pre-flight error about Docker.
Cause: One of the Docker pre-flight checks failed. Pre-flight checks verify:
- Docker CLI is on PATH.
- NVIDIA Container Toolkit binary is on PATH (
nvidia-container-runtime,nvidia-ctk, ornvidia-container-cli). - Host
nvidia-smipresent (warn only — remote Docker daemon is supported). - GPU is visible inside a container (
docker run --gpus all nvidia-smi). - CUDA/driver compatibility (checked from container probe output).
Fix: Read the error message — it identifies which check failed.
Docker not found: install Docker Engine (docs.docker.com/engine/install/).NVIDIA Container Toolkit not found: follow docker-setup.md.GPU not visible inside container: check that--gpus allworks withdocker run --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi.CUDA/driver mismatch: update the host NVIDIA driver to be compatible with the container's CUDA version.
Skip pre-flight checks for testing or remote daemon setups:
llem run study.yaml --skip-preflight
Out of memory (OOM)
Symptom: The run crashes with an OOM error (CUDA out of memory).
Cause: The model is too large for the available GPU VRAM at the current configuration.
Fix (try in order):
- Use a smaller model.
- Reduce
transformers.batch_size(default is 1 — already minimal for Transformers). - Switch to lower dtype:
dtype: float16ordtype: bfloat16. - Enable BitsAndBytes quantization:
pytorch: { load_in_4bit: true }. - For vLLM: reduce
vllm.engine.gpu_memory_utilization(e.g. 0.7 instead of 0.9). - For vLLM: reduce
vllm.engine.max_model_lento cap KV cache allocation.
Permission denied (Docker)
Symptom: docker run fails with permission denied.
Cause: The current user is not in the docker group.
Fix:
sudo usermod -aG docker $USER
newgrp docker # activate without logout
Or prefix Docker commands with sudo. Permanent fix requires re-login.
Study failed partially
Symptom: A study run produces some results but not all. Some experiments are missing from the output directory.
Cause: Individual experiments may fail while the study continues. llenergymeasure uses skip-and-continue: a failed experiment is recorded as an error in the study manifest, and execution continues with the remaining experiments.
Fix:
- Check the study manifest in
results/for per-experiment status and error messages. - The manifest records the failure reason for each skipped config.
- Fix the failing config (see the error message) and re-run the specific experiment separately before merging results.
Container crashes with "context canceled" or ValidationError
Symptom: Experiments using Docker runners fail immediately. The container log shows
context canceled or a Pydantic ValidationError mentioning unknown fields (e.g.
dtype: Extra inputs are not permitted or dataset: Input should be a valid string).
Cause: The Transformers Docker image was built from an older version of the source code. The host sends config JSON using the current schema, but the container rejects it because its bundled code expects the old schema. This only applies to Transformers; the vLLM and TensorRT-LLM containers bind-mount host source at run time, so they always see the current schema.
Fix: Rebuild the Transformers image from the current source:
docker build -f docker/Dockerfile.transformers -t ghcr.io/henrycgbaker/llenergymeasure/transformers:v0.9.0 .
Replace v0.9.0 with your installed version (llem --version). See
Installation - Getting Engine Images
for full instructions.
Results look wrong / energy is 0
Symptom: inference_energy_joules is 0.0, or energy values seem too low.
Cause: Energy measurement requires NVML (pynvml) and a supported NVIDIA GPU. If NVML is unavailable, energy measurement falls back gracefully to zero rather than crashing.
Fix:
- Check
llem config— it shows the active energy sampler underEnergy:. - Verify pynvml can access the GPU: run
python -c "import pynvml; pynvml.nvmlInit(); print('OK')". - Check your config. Setting
energy_sampler: nullexplicitly disables energy measurement (throughput-only mode). - If
baseline.enabled: true(default), ensure the baseline measurement is completing. A failed baseline causesadjusted_jto be null. - For very short inference runs (< 200ms), NVML polling at 100ms intervals may not collect
enough samples for accurate integration. Use larger
nvalues.
Zeus and CodeCarbon are optional extras. If they are not installed, the tool falls back to NVML. See energy-measurement.md for backend details.
Warmup takes too long
Symptom: Experiments take much longer than expected. Progress stalls for 1-2 minutes before measurement begins.
Cause: Warmup is enabled by default. It runs n_warmup=5 warmup inferences, then
waits thermal_floor_seconds=60.0 seconds for GPU temperature to stabilise before
measuring.
Fix for quick testing:
warmup:
enabled: false
Or on the CLI:
llem run experiment.yaml # add to YAML for testing
For publication-quality measurements, leave warmup enabled. See methodology.md for why warmup matters.
Invalid Parameter Combinations
Config Validation Errors
These combinations are rejected at config load time with a clear error message.
| Engine | Invalid Combination | Reason | Resolution |
|---|---|---|---|
| pytorch | load_in_4bit=True + load_in_8bit=True | Cannot use both 4-bit and 8-bit quantization simultaneously | Choose one: transformers.load_in_4bit: true OR transformers.load_in_8bit: true |
| pytorch | torch_compile_mode without torch_compile=True | torch_compile_mode/torch_compile_backend only take effect when torch_compile=True | Set transformers.torch_compile: true when using torch_compile_mode or torch_compile_backend |
| pytorch | bnb_4bit_* without load_in_4bit=True | BitsAndBytes 4-bit options require 4-bit quantization to be enabled | Set transformers.load_in_4bit: true when using bnb_4bit_compute_dtype, bnb_4bit_quant_type, or bnb_4bit_use_double_quant |
| pytorch | cache_implementation with use_cache=False | Cannot specify a cache strategy when caching is explicitly disabled | Remove use_cache: false or remove cache_implementation |
| all | engine section mismatch | Engine section must match the engine: field | Ensure pytorch: / vllm: / tensorrt: section matches engine: field |
| all | passthrough_kwargs key collision | passthrough_kwargs keys must not collide with ExperimentConfig fields | Use named fields directly instead of passthrough_kwargs |
| tensorrt | dtype: float32 | TensorRT-LLM is optimised for lower precision inference | Use dtype: float16 or dtype: bfloat16 |
| vllm | transformers.load_in_4bit or pytorch.load_in_8bit | vLLM does not support bitsandbytes quantization | Use vllm.engine.quantization (awq, gptq, fp8) for quantized inference |
Runtime Limitations
These combinations pass config validation but may fail at runtime due to hardware, model, or package requirements.
| Engine | Parameter | Limitation | Resolution |
|---|---|---|---|
| pytorch | transformers.attn_implementation: flash_attention_2 | flash-attn requires Ampere+ GPU; may fail on older architectures | Use attn_implementation: sdpa on pre-Ampere GPUs |
| pytorch | transformers.attn_implementation: flash_attention_3 | FA3 requires the flash_attn_3 package (built from flash-attn hopper/ directory) and Ampere+ GPU (SM80+). Included in the Docker image by default | Install locally from source if not using Docker. See Installation - FA3 |
| vllm | vllm.engine.kv_cache_dtype: fp8 | FP8 KV cache requires Hopper (H100) or newer GPU | Use kv_cache_dtype: auto for automatic selection |
| vllm | vllm.engine.attention.backend: FLASHINFER | FlashInfer requires JIT compilation on first use | Use attention.backend: auto or FLASH_ATTN |
| vllm | vllm.engine.attention.backend: TORCH_SDPA | TORCH_SDPA not registered in vLLM attention backends | Use attention.backend: auto or FLASH_ATTN |
| vllm | vllm.engine.quantization: awq or gptq | Requires a pre-quantized model checkpoint | Use a quantized model (e.g. TheBloke/*-AWQ) or omit |
| tensorrt | tensorrt.quant_config.quant_algo: FP8 | FP8 requires SM >= 8.9 (Ada Lovelace or Hopper). A100 (SM80) raises a ConfigurationError — no silent emulation or fallback | Use INT8, W4A16_AWQ, W4A16_GPTQ, or W8A16 on A100 |
| tensorrt | tensorrt.quantization: int8_sq | INT8 SmoothQuant requires a calibration dataset | Provide calibration config or use a supported quantization method |
Engine Capability Matrix
| Feature | Transformers | vLLM | TensorRT |
|---|---|---|---|
| Tensor Parallel | Yes (HF native) | Yes | Yes |
| Data Parallel | Yes | No | No |
| BitsAndBytes (4-bit) | Yes | No | No |
| BitsAndBytes (8-bit) | Yes | No | No |
| Native Quantization | No | AWQ / GPTQ / FP8 | INT8 / W4A16 (AWQ/GPTQ) / FP8 |
| float32 precision | Yes | No | No |
| float16 precision | Yes | Yes | Yes |
| bfloat16 precision | Yes | Yes | Yes |
| Prefix Caching | No | Yes | No |
| LoRA Adapters | Yes | No | No |
| torch.compile | Yes | No | No |
| Beam Search | Yes | Yes | No |
| Speculative Decoding | Yes | Yes | No |
| Static KV Cache | Yes | No | No |
Notes:
- Transformers Tensor Parallel uses HF native TP via
tp_plan/tp_size(requires Transformers >= 4.50 andtorchrunlaunch). - vLLM does not support FP32 precision. Use FP16 or BF16.
- vLLM supports 4-bit via AWQ/GPTQ quantized models, not bitsandbytes.
- TensorRT-LLM is optimised for FP16/BF16/INT8 precision, not FP32.
Docker rebuild is slow / recompiling flash-attn
Symptom: make docker-build-transformers takes 15-20 minutes and the post-build
summary line reports ⚠ no GHCR cache imported (cold build) (or BuildKit
output shows flash-attn source downloads and nvcc compilation for every
build).
Cause: BuildKit's cache_from registry pull was skipped. In rough order
of likelihood:
(a) BUILDX_BUILDER is unset or pointing at the default docker driver.
The default driver cannot import registry caches at all — cache_from
entries are silently ignored. Confirm with docker buildx ls: the row
marked with * (current builder) must show driver docker-container,
not docker. Fix by adding BUILDX_BUILDER=llem-builder to your .env
(it ships in .env.example) and re-running make docker-builder-setup
if the builder doesn't exist yet.
(b) You are on a fresh buildx builder with no local cache (this is normal
on the very first build — first-pull cost is paid once).
(c) You are offline or GHCR is unreachable.
(d) Your TRANSFORMERS_VERSION (from engine_versions/transformers.yaml)
does not match any published cache tag (cache_from resolves to
:transformers-<VERSION> and falls through to :latest — if neither has
usable layers, BuildKit silently cold-builds).
The full BuildKit log for the most recent attempt is at
/tmp/llem-build-{engine}.log — grep it for importing cache manifest to
see whether the registry was even reached.
Fix:
- Confirm the builder driver:
docker buildx ls. The active builder (marked*) must bedocker-container. If it'sdocker, runmake docker-builder-setupand ensureBUILDX_BUILDER=llem-builderis in your.env(or exported in the shell). - Inspect the builder cache:
docker buildx du --builder llem-builder. If it's near-empty, BuildKit has nothing to reuse locally and will pull from the registry. - Verify network:
curl -I https://ghcr.io/v2/henrycgbaker/llenergymeasure/transformers/manifests/latestshould return 200 or 401 (both fine; 000/timeout means no connectivity). - If you recently bumped the SSOT version but CI hasn't published the
per-version tag yet, fall back to
:latest(which the cache_from chain already lists as a fallback). No env-var override needed. - If the cache is corrupt, recreate the builder:
make docker-builder-rm && make docker-builder-setup. Note this discards all local layer cache; the first subsequent build will repopulate from GHCR. - Offline is expected-slow. BuildKit degrades gracefully to a cold build — no errors, just minutes.
CI can't build the Transformers image (FA3 compile OOM / heartbeat loss): The FA3 Hopper compile requires ~8-16 GB RAM and multiple hours on a 4-core runner. Seed the GHCR cache once from a developer machine with more resources:
docker login ghcr.io # needs write:packages scope
make docker-seed-transformers # builds + pushes cache to ghcr.io (~minutes if locally cached)
After seeding, CI warm-rebuilds from the GHCR cache in <5 min.
Schema skew between host and Docker image
Symptom: llem run study.yaml aborts before any experiment with a message
like:
Docker image 'llenergymeasure:transformers' was built from llenergymeasure 0.9.0
(schema 9988776655ff) but the host is running 0.9.0 (schema a1b2c3d4e5f6).
The container will reject ExperimentConfig fields added on the host after
the image was built.
A container stack trace full of extra_forbidden Pydantic errors (often with
URLs mixing errors.pydantic.dev/2.10/… and errors.pydantic.dev/2.12/…, a
tell for version skew).
Legacy: the schema-fingerprint handshake no longer catches this. Earlier versions of
llemstamped each image with anllem.expconf.schema.fingerprintlabel andStudyRunner._prepare_imagescompared it to the host fingerprint before any experiment ran. Once images stopped baking the project source (it is bind-mounted at runtime), the handshake became structurally redundant: the in-container source always equals the host source. The label is no longer set on any engine image;llem doctorreportsUNVERIFIEDand does not block. The deadversion_handshake.pyplumbing is tracked for removal in a follow-up issue.
Cause: all three engines now bind-mount the host project source at runtime, so a Pydantic-shape error here means the engine library inside the image is at a version that no longer matches what the host code expects (e.g. the SSOT bumped transformers but the local image is still on an older tag).
Fix: rebuild or repull the affected engine image. One of:
make docker-build-transformers # local build, Transformers
docker pull vllm/vllm-openai:0.7.3 # repull vLLM upstream
docker pull nvcr.io/nvidia/tensorrt-llm/release:0.21.0 # repull TensorRT-LLM upstream
make docker-pull # pull the newest published Transformers tag
Verify the image actually changed by inspecting the digest:
docker image inspect llenergymeasure:transformers --format '{{.Id}}'
Getting Help
Run llem config --verbose to capture full environment details (Python version, installed
engines, GPU info, energy sampler status, config file path). Include this output when
filing a bug report.
File issues at: github.com/henrycgbaker/llenergymeasure/issues