Skip to main content

Troubleshooting

Common issues and solutions for llenergymeasure.


Common Issues

No GPU detected

Symptom: llem config shows no GPU, or measurement fails with a CUDA error.

Cause: NVIDIA drivers are not installed, the device is not visible in the current environment, or the system is CPU-only.

Fix:

  1. Run nvidia-smi to verify the GPU is visible on the host.
  2. Run llem config to see what the tool detects.
  3. If nvidia-smi works but the engine does not, you may be running outside a container that has CUDA — for vLLM and TensorRT-LLM, use llem run study.yaml with Docker runners (see docker-setup.md).
  4. If nvidia-smi fails, install or reinstall the NVIDIA drivers for your OS.

Engine not available on host

Symptom: llem run -e vllm ... fails immediately with an import error mentioning the engine package.

Cause: Engines have no host install path — they run inside per-engine Docker images. A host import of transformers, vllm, or tensorrt_llm will always fail by design.

Fix: Build the engine image and dispatch via the Docker runner. The canonical pattern is in development.md; the short form is:

VER=$(yq '.library.current_version' engine_versions/transformers.yaml)
docker build -f docker/Dockerfile.transformers \
--build-arg TRANSFORMERS_VERSION="$VER" \
-t llenergymeasure:transformers-${VER} .

Replace transformers with vllm or tensorrt (and add --gpus all for those two) for the other engines. Then run llem run with a Docker runner configured for the engine — see docker-setup.md.

Run llem config to see the current status of each engine.


Docker pre-flight failed

Symptom: llem run study.yaml exits early with a pre-flight error about Docker.

Cause: One of the Docker pre-flight checks failed. Pre-flight checks verify:

  1. Docker CLI is on PATH.
  2. NVIDIA Container Toolkit binary is on PATH (nvidia-container-runtime, nvidia-ctk, or nvidia-container-cli).
  3. Host nvidia-smi present (warn only — remote Docker daemon is supported).
  4. GPU is visible inside a container (docker run --gpus all nvidia-smi).
  5. CUDA/driver compatibility (checked from container probe output).

Fix: Read the error message — it identifies which check failed.

  • Docker not found: install Docker Engine (docs.docker.com/engine/install/).
  • NVIDIA Container Toolkit not found: follow docker-setup.md.
  • GPU not visible inside container: check that --gpus all works with docker run --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi.
  • CUDA/driver mismatch: update the host NVIDIA driver to be compatible with the container's CUDA version.

Skip pre-flight checks for testing or remote daemon setups:

llem run study.yaml --skip-preflight

Out of memory (OOM)

Symptom: The run crashes with an OOM error (CUDA out of memory).

Cause: The model is too large for the available GPU VRAM at the current configuration.

Fix (try in order):

  1. Use a smaller model.
  2. Reduce transformers.batch_size (default is 1 — already minimal for Transformers).
  3. Switch to lower dtype: dtype: float16 or dtype: bfloat16.
  4. Enable BitsAndBytes quantization: pytorch: { load_in_4bit: true }.
  5. For vLLM: reduce vllm.engine.gpu_memory_utilization (e.g. 0.7 instead of 0.9).
  6. For vLLM: reduce vllm.engine.max_model_len to cap KV cache allocation.

Permission denied (Docker)

Symptom: docker run fails with permission denied.

Cause: The current user is not in the docker group.

Fix:

sudo usermod -aG docker $USER
newgrp docker # activate without logout

Or prefix Docker commands with sudo. Permanent fix requires re-login.


Study failed partially

Symptom: A study run produces some results but not all. Some experiments are missing from the output directory.

Cause: Individual experiments may fail while the study continues. llenergymeasure uses skip-and-continue: a failed experiment is recorded as an error in the study manifest, and execution continues with the remaining experiments.

Fix:

  1. Check the study manifest in results/ for per-experiment status and error messages.
  2. The manifest records the failure reason for each skipped config.
  3. Fix the failing config (see the error message) and re-run the specific experiment separately before merging results.

Container crashes with "context canceled" or ValidationError

Symptom: Experiments using Docker runners fail immediately. The container log shows context canceled or a Pydantic ValidationError mentioning unknown fields (e.g. dtype: Extra inputs are not permitted or dataset: Input should be a valid string).

Cause: The Transformers Docker image was built from an older version of the source code. The host sends config JSON using the current schema, but the container rejects it because its bundled code expects the old schema. This only applies to Transformers; the vLLM and TensorRT-LLM containers bind-mount host source at run time, so they always see the current schema.

Fix: Rebuild the Transformers image from the current source:

docker build -f docker/Dockerfile.transformers -t ghcr.io/henrycgbaker/llenergymeasure/transformers:v0.9.0 .

Replace v0.9.0 with your installed version (llem --version). See Installation - Getting Engine Images for full instructions.


Results look wrong / energy is 0

Symptom: inference_energy_joules is 0.0, or energy values seem too low.

Cause: Energy measurement requires NVML (pynvml) and a supported NVIDIA GPU. If NVML is unavailable, energy measurement falls back gracefully to zero rather than crashing.

Fix:

  1. Check llem config — it shows the active energy sampler under Energy:.
  2. Verify pynvml can access the GPU: run python -c "import pynvml; pynvml.nvmlInit(); print('OK')".
  3. Check your config. Setting energy_sampler: null explicitly disables energy measurement (throughput-only mode).
  4. If baseline.enabled: true (default), ensure the baseline measurement is completing. A failed baseline causes adjusted_j to be null.
  5. For very short inference runs (< 200ms), NVML polling at 100ms intervals may not collect enough samples for accurate integration. Use larger n values.

Zeus and CodeCarbon are optional extras. If they are not installed, the tool falls back to NVML. See energy-measurement.md for backend details.


Warmup takes too long

Symptom: Experiments take much longer than expected. Progress stalls for 1-2 minutes before measurement begins.

Cause: Warmup is enabled by default. It runs n_warmup=5 warmup inferences, then waits thermal_floor_seconds=60.0 seconds for GPU temperature to stabilise before measuring.

Fix for quick testing:

warmup:
enabled: false

Or on the CLI:

llem run experiment.yaml # add to YAML for testing

For publication-quality measurements, leave warmup enabled. See methodology.md for why warmup matters.


Invalid Parameter Combinations

Config Validation Errors

These combinations are rejected at config load time with a clear error message.

EngineInvalid CombinationReasonResolution
pytorchload_in_4bit=True + load_in_8bit=TrueCannot use both 4-bit and 8-bit quantization simultaneouslyChoose one: transformers.load_in_4bit: true OR transformers.load_in_8bit: true
pytorchtorch_compile_mode without torch_compile=Truetorch_compile_mode/torch_compile_backend only take effect when torch_compile=TrueSet transformers.torch_compile: true when using torch_compile_mode or torch_compile_backend
pytorchbnb_4bit_* without load_in_4bit=TrueBitsAndBytes 4-bit options require 4-bit quantization to be enabledSet transformers.load_in_4bit: true when using bnb_4bit_compute_dtype, bnb_4bit_quant_type, or bnb_4bit_use_double_quant
pytorchcache_implementation with use_cache=FalseCannot specify a cache strategy when caching is explicitly disabledRemove use_cache: false or remove cache_implementation
allengine section mismatchEngine section must match the engine: fieldEnsure pytorch: / vllm: / tensorrt: section matches engine: field
allpassthrough_kwargs key collisionpassthrough_kwargs keys must not collide with ExperimentConfig fieldsUse named fields directly instead of passthrough_kwargs
tensorrtdtype: float32TensorRT-LLM is optimised for lower precision inferenceUse dtype: float16 or dtype: bfloat16
vllmtransformers.load_in_4bit or pytorch.load_in_8bitvLLM does not support bitsandbytes quantizationUse vllm.engine.quantization (awq, gptq, fp8) for quantized inference

Runtime Limitations

These combinations pass config validation but may fail at runtime due to hardware, model, or package requirements.

EngineParameterLimitationResolution
pytorchtransformers.attn_implementation: flash_attention_2flash-attn requires Ampere+ GPU; may fail on older architecturesUse attn_implementation: sdpa on pre-Ampere GPUs
pytorchtransformers.attn_implementation: flash_attention_3FA3 requires the flash_attn_3 package (built from flash-attn hopper/ directory) and Ampere+ GPU (SM80+). Included in the Docker image by defaultInstall locally from source if not using Docker. See Installation - FA3
vllmvllm.engine.kv_cache_dtype: fp8FP8 KV cache requires Hopper (H100) or newer GPUUse kv_cache_dtype: auto for automatic selection
vllmvllm.engine.attention.backend: FLASHINFERFlashInfer requires JIT compilation on first useUse attention.backend: auto or FLASH_ATTN
vllmvllm.engine.attention.backend: TORCH_SDPATORCH_SDPA not registered in vLLM attention backendsUse attention.backend: auto or FLASH_ATTN
vllmvllm.engine.quantization: awq or gptqRequires a pre-quantized model checkpointUse a quantized model (e.g. TheBloke/*-AWQ) or omit
tensorrttensorrt.quant_config.quant_algo: FP8FP8 requires SM >= 8.9 (Ada Lovelace or Hopper). A100 (SM80) raises a ConfigurationError — no silent emulation or fallbackUse INT8, W4A16_AWQ, W4A16_GPTQ, or W8A16 on A100
tensorrttensorrt.quantization: int8_sqINT8 SmoothQuant requires a calibration datasetProvide calibration config or use a supported quantization method

Engine Capability Matrix

FeatureTransformersvLLMTensorRT
Tensor ParallelYes (HF native)YesYes
Data ParallelYesNoNo
BitsAndBytes (4-bit)YesNoNo
BitsAndBytes (8-bit)YesNoNo
Native QuantizationNoAWQ / GPTQ / FP8INT8 / W4A16 (AWQ/GPTQ) / FP8
float32 precisionYesNoNo
float16 precisionYesYesYes
bfloat16 precisionYesYesYes
Prefix CachingNoYesNo
LoRA AdaptersYesNoNo
torch.compileYesNoNo
Beam SearchYesYesNo
Speculative DecodingYesYesNo
Static KV CacheYesNoNo

Notes:

  • Transformers Tensor Parallel uses HF native TP via tp_plan/tp_size (requires Transformers >= 4.50 and torchrun launch).
  • vLLM does not support FP32 precision. Use FP16 or BF16.
  • vLLM supports 4-bit via AWQ/GPTQ quantized models, not bitsandbytes.
  • TensorRT-LLM is optimised for FP16/BF16/INT8 precision, not FP32.

Docker rebuild is slow / recompiling flash-attn

Symptom: make docker-build-transformers takes 15-20 minutes and the post-build summary line reports ⚠ no GHCR cache imported (cold build) (or BuildKit output shows flash-attn source downloads and nvcc compilation for every build).

Cause: BuildKit's cache_from registry pull was skipped. In rough order of likelihood:

(a) BUILDX_BUILDER is unset or pointing at the default docker driver. The default driver cannot import registry caches at all — cache_from entries are silently ignored. Confirm with docker buildx ls: the row marked with * (current builder) must show driver docker-container, not docker. Fix by adding BUILDX_BUILDER=llem-builder to your .env (it ships in .env.example) and re-running make docker-builder-setup if the builder doesn't exist yet. (b) You are on a fresh buildx builder with no local cache (this is normal on the very first build — first-pull cost is paid once). (c) You are offline or GHCR is unreachable. (d) Your TRANSFORMERS_VERSION (from engine_versions/transformers.yaml) does not match any published cache tag (cache_from resolves to :transformers-<VERSION> and falls through to :latest — if neither has usable layers, BuildKit silently cold-builds).

The full BuildKit log for the most recent attempt is at /tmp/llem-build-{engine}.log — grep it for importing cache manifest to see whether the registry was even reached.

Fix:

  1. Confirm the builder driver: docker buildx ls. The active builder (marked *) must be docker-container. If it's docker, run make docker-builder-setup and ensure BUILDX_BUILDER=llem-builder is in your .env (or exported in the shell).
  2. Inspect the builder cache: docker buildx du --builder llem-builder. If it's near-empty, BuildKit has nothing to reuse locally and will pull from the registry.
  3. Verify network: curl -I https://ghcr.io/v2/henrycgbaker/llenergymeasure/transformers/manifests/latest should return 200 or 401 (both fine; 000/timeout means no connectivity).
  4. If you recently bumped the SSOT version but CI hasn't published the per-version tag yet, fall back to :latest (which the cache_from chain already lists as a fallback). No env-var override needed.
  5. If the cache is corrupt, recreate the builder: make docker-builder-rm && make docker-builder-setup. Note this discards all local layer cache; the first subsequent build will repopulate from GHCR.
  6. Offline is expected-slow. BuildKit degrades gracefully to a cold build — no errors, just minutes.

CI can't build the Transformers image (FA3 compile OOM / heartbeat loss): The FA3 Hopper compile requires ~8-16 GB RAM and multiple hours on a 4-core runner. Seed the GHCR cache once from a developer machine with more resources:

docker login ghcr.io # needs write:packages scope
make docker-seed-transformers # builds + pushes cache to ghcr.io (~minutes if locally cached)

After seeding, CI warm-rebuilds from the GHCR cache in <5 min.


Schema skew between host and Docker image

Symptom: llem run study.yaml aborts before any experiment with a message like:

Docker image 'llenergymeasure:transformers' was built from llenergymeasure 0.9.0
(schema 9988776655ff) but the host is running 0.9.0 (schema a1b2c3d4e5f6).
The container will reject ExperimentConfig fields added on the host after
the image was built.

A container stack trace full of extra_forbidden Pydantic errors (often with URLs mixing errors.pydantic.dev/2.10/… and errors.pydantic.dev/2.12/…, a tell for version skew).

Legacy: the schema-fingerprint handshake no longer catches this. Earlier versions of llem stamped each image with an llem.expconf.schema.fingerprint label and StudyRunner._prepare_images compared it to the host fingerprint before any experiment ran. Once images stopped baking the project source (it is bind-mounted at runtime), the handshake became structurally redundant: the in-container source always equals the host source. The label is no longer set on any engine image; llem doctor reports UNVERIFIED and does not block. The dead version_handshake.py plumbing is tracked for removal in a follow-up issue.

Cause: all three engines now bind-mount the host project source at runtime, so a Pydantic-shape error here means the engine library inside the image is at a version that no longer matches what the host code expects (e.g. the SSOT bumped transformers but the local image is still on an older tag).

Fix: rebuild or repull the affected engine image. One of:

make docker-build-transformers # local build, Transformers
docker pull vllm/vllm-openai:0.7.3 # repull vLLM upstream
docker pull nvcr.io/nvidia/tensorrt-llm/release:0.21.0 # repull TensorRT-LLM upstream
make docker-pull # pull the newest published Transformers tag

Verify the image actually changed by inspecting the digest:

docker image inspect llenergymeasure:transformers --format '{{.Id}}'

Getting Help

Run llem config --verbose to capture full environment details (Python version, installed engines, GPU info, energy sampler status, config file path). Include this output when filing a bug report.

File issues at: github.com/henrycgbaker/llenergymeasure/issues