Skip to main content

Engine Configuration

llenergymeasure supports multiple inference engines. each engine uses a different runtime and requires different setup. Currently active: Transformers (local), vLLM (Docker), and TensorRT-LLM (Docker). Planned: SGLang (M5).


Engine Overview

EngineRunnerGPU RequiredStatus
TransformerslocalYesActive
vLLMdockerYesActive
TensorRT-LLMdockerYesActive
SGLangdockerYesPlanned (M5)

Runner determines where the engine executes: local runs directly on the host, docker launches an isolated container.


Transformers (Local)

The default engine. Runs the HuggingFace transformers AutoModelForCausalLM stack directly on the host. No Docker required.

Minimal config:

model: gpt2
engine: transformers

With Transformers-specific options:

model: gpt2
engine: transformers
n: 100
dtype: bfloat16
transformers:
batch_size: 4
attn_implementation: sdpa
torch_compile: false
load_in_4bit: false

Transformers Parameters

All pytorch: fields default to nullnull means "use the engine's own default". Unknown fields under pytorch: are forwarded to HuggingFace APIs.

Batching:

ParameterTypeDefaultDescription
batch_sizeint1Number of prompts processed per forward pass

Attention:

ParameterTypeDefaultDescription
attn_implementationsdpa | flash_attention_2 | flash_attention_3 | eagersdpaAttention kernel

Note: flash_attention_3 requires the flash_attn_3 package (built separately from the flash-attn repo's hopper/ directory) and an Ampere+ GPU (SM80+, e.g. A100 or H100). The Docker Transformers image includes FA3 by default. To skip it (e.g. for faster CI builds), rebuild with --build-arg INSTALL_FA3=false. See Installation - FlashAttention-3 for details.

Compilation:

ParameterTypeDefaultDescription
torch_compileboolfalseEnable torch.compile
torch_compile_modestrdefaultCompile mode: default, reduce-overhead, max-autotune
torch_compile_backendstrinductorCompile backend

Note: torch_compile_mode and torch_compile_backend require torch_compile: true.

Quantization (BitsAndBytes):

ParameterTypeDefaultDescription
load_in_4bitboolfalseBitsAndBytes 4-bit quantization (NF4)
load_in_8bitboolfalseBitsAndBytes 8-bit quantization
bnb_4bit_compute_dtypefloat16 | bfloat16 | float32float32Compute dtype for 4-bit (usually set to bfloat16)
bnb_4bit_quant_typenf4 | fp4nf44-bit quantization type
bnb_4bit_use_double_quantboolfalseDouble quantization (saves ~0.4 bits/param)

Note: load_in_4bit and load_in_8bit are mutually exclusive. bnb_4bit_* fields require load_in_4bit: true.

KV Cache:

ParameterTypeDefaultDescription
use_cachebooltrueEnable KV cache during generation
cache_implementationstatic | offloaded_static | sliding_windowdynamicKV cache strategy; static enables CUDA graphs

Model Loading:

ParameterTypeDefaultDescription
device_mapstrautoDevice placement strategy
max_memorydictnullPer-device memory limits, e.g. {0: "10GiB", cpu: "50GiB"}
low_cpu_mem_usagebooltrueLoad weights incrementally to minimise peak CPU RAM

Floating-Point Precision:

ParameterTypeDefaultDescription
allow_tf32boolnullAllow TF32 for matrix multiplications on Ampere+ (affects energy/throughput)
autocast_enabledboolfalseEnable torch.autocast during generation
autocast_dtypefloat16 | bfloat16bfloat16AMP dtype (only used when autocast_enabled: true)

Beam Search:

ParameterTypeDefaultDescription
num_beamsint1Beam search width (1 = greedy/sampling)
early_stoppingboolfalseStop when all beams hit EOS
length_penaltyfloat1.0Length penalty: >1 shorter, <1 longer

N-gram Repetition:

ParameterTypeDefaultDescription
no_repeat_ngram_sizeint0Prevent n-gram repetition (0 = disabled)

Speculative Decoding:

ParameterTypeDefaultDescription
prompt_lookup_num_tokensintnullPrompt-lookup speculative decoding (disabled when null)

Tensor Parallelism (HF Transformers >= 4.50):

ParameterTypeDefaultDescription
tp_planauto | nullnullNative HF tensor parallelism plan. Mutually exclusive with device_map. Requires torchrun launch.
tp_sizeintWORLD_SIZENumber of tensor parallel ranks. Only used when tp_plan is set.

Note: tp_plan and device_map are mutually exclusive — tensor parallelism handles its own device placement. When tp_plan='auto', device_map is automatically omitted.


vLLM (Docker)

A high-throughput inference engine using PagedAttention and continuous batching. Requires Docker with NVIDIA Container Toolkit. See Docker Setup Guide for installation instructions.

Minimal config:

model: gpt2
engine: vllm
runners:
vllm: docker

With vLLM-specific options:

model: gpt2
engine: vllm
n: 100
runners:
vllm: docker
vllm:
engine:
enforce_eager: false
gpu_memory_utilization: 0.9
block_size: 16
kv_cache_dtype: auto
sampling:
max_tokens: 128
presence_penalty: 0.0

Nested structure required. vLLM config uses a nested engine: / sampling: structure that mirrors vLLM's own API separation. Flat vllm: configs (from pre-M3 versions) are not supported.

vLLM Engine Parameters

vllm.engine: fields map to vllm.LLM() constructor arguments. These are set at model initialisation time. All fields default to null (use vLLM's own default).

Memory Management:

ParameterTypeDefaultDescription
gpu_memory_utilizationfloat [0.0, 1.0)0.9GPU memory fraction reserved for KV cache
swap_spacefloat4 GiBCPU swap space in GiB for KV cache offloading
cpu_offload_gbfloat0CPU RAM in GiB to offload model weights to

KV Cache:

ParameterTypeDefaultDescription
block_size8 | 16 | 3216KV cache block size in tokens
kv_cache_dtypeauto | fp8 | fp8_e5m2 | fp8_e4m3autoKV cache storage dtype; fp8 halves VRAM on Ampere+
kv_cache_memory_bytesintnullAbsolute KV cache size in bytes (mutually exclusive with gpu_memory_utilization)

Execution Mode:

ParameterTypeDefaultDescription
enforce_eagerboolfalseDisable CUDA graphs, always use eager mode
enable_chunked_prefillboolfalseChunk large prefills across scheduler iterations

Scheduler / Batching:

ParameterTypeDefaultDescription
max_num_seqsint256Max concurrent sequences per scheduler iteration
max_num_batched_tokensintautoMax tokens processed per scheduler iteration
max_model_lenintmodel defaultMax sequence length (input + output tokens)

Parallelism:

ParameterTypeDefaultDescription
tensor_parallel_sizeint1Number of GPUs to shard the model across
pipeline_parallel_sizeint1Pipeline parallel stages

Quantization and Prefix Caching:

ParameterTypeDefaultDescription
quantizationawq | gptq | fp8 | marlin | bitsandbytes | ...nullQuantization method (requires pre-quantised checkpoint)
enable_prefix_cachingboolfalseAutomatic prefix caching for shared prompt prefixes

Scheduler Tuning:

ParameterTypeDefaultDescription
num_scheduler_stepsint1Multi-step scheduling: run N decode steps before returning to the scheduler (increases throughput at the cost of more VRAM per step)
max_seq_len_to_captureint8192Maximum sequence length eligible for CUDA graph capture (sequences longer than this fall back to eager mode)

Distributed Execution:

ParameterTypeDefaultDescription
distributed_executor_backendmp | raympMulti-GPU executor backend (mp = multiprocessing, ray = Ray cluster)

Speculative Decoding:

ParameterTypeDefaultDescription
speculative_configVLLMSpeculativeConfignullSpeculative decoding sub-config (see below)

speculative_config sub-config fields:

FieldTypeDefaultDescription
modelstrnullHF model name or path for the draft model
num_speculative_tokensintnullTokens to draft per step
methodstrnullSpeculative method (e.g. eagle, medusa; null = draft-model mode)

vLLM Sampling Parameters

vllm.sampling: fields map to vLLM-specific SamplingParams extensions. Universal sampling parameters (temperature, top_p, top_k, repetition_penalty) live in decoder: and are shared across all engines.

ParameterTypeDefaultDescription
max_tokensintuses max_output_tokensMax output tokens (overrides max_output_tokens for vLLM sweeps)
min_tokensint0Minimum output tokens before EOS is allowed
presence_penaltyfloat [-2.0, 2.0]0.0Penalises tokens that appear at all
frequency_penaltyfloat [-2.0, 2.0]0.0Penalises tokens proportional to their frequency
ignore_eosboolfalseContinue generating past EOS (forces max_tokens generation)
nint1Number of output sequences per prompt

When using beam search with vLLM, use vllm.beam_search: instead of vllm.sampling:. The two sections are mutually exclusive.

vllm:
engine:
enforce_eager: false
beam_search:
beam_width: 4
length_penalty: 1.0
early_stopping: false
max_tokens: 128

Passthrough for New vLLM Parameters

Unknown fields under vllm.engine: or vllm.sampling: are forwarded to vLLM's native LLM() and SamplingParams() APIs. This lets you use new vLLM parameters without waiting for llenergymeasure to add explicit support:

vllm:
engine:
some_new_vllm_param: true # forwarded directly to vllm.LLM()
sampling:
another_new_param: 0.5 # forwarded directly to vllm.SamplingParams()

TensorRT-LLM (Docker)

A maximum-performance inference engine using NVIDIA TensorRT engine compilation. TRT-LLM compiles a model into an optimised TensorRT engine on first use, then runs inference against that engine. Engines are cached on disk so subsequent runs skip compilation. Requires Docker with NVIDIA Container Toolkit. See Docker Setup Guide for installation instructions.

Minimal config:

model: meta-llama/Llama-2-7b-hf
engine: tensorrt
runners:
tensorrt: docker

With TensorRT-LLM-specific options:

model: meta-llama/Llama-2-7b-hf
engine: tensorrt
n: 50
dtype: bfloat16
runners:
tensorrt: docker
tensorrt:
tensor_parallel_size: 2
max_batch_size: 8
max_num_tokens: 4096
dtype: bfloat16
quant:
quant_algo: W4A16_AWQ

Engine compilation on first run. The first run with a given config will compile a TensorRT engine (which may take several minutes). Subsequent runs with the same config use the cached engine and start much faster.

Compile-Time Parameters

These parameters define the engine shape and cannot be changed without recompiling. Changing any [recompile] field invalidates the cached engine and triggers a new build.

ParameterTypeDefaultDescription
max_batch_sizeint8Maximum batch size the engine accepts. [recompile]
tensor_parallel_sizeint1Tensor parallel degree (number of GPUs). [recompile]
pipeline_parallel_sizeint1Pipeline parallel stages (number of pipeline stages across GPUs). [recompile]
max_input_lenint1024Maximum input sequence length in tokens. [recompile]
max_seq_lenint2048Maximum total sequence length (input + output). [recompile]
max_num_tokensintautoMaximum tokens the engine handles per iteration (scheduler throughput axis alongside max_batch_size). [recompile]
dtypefloat16 | bfloat16autoModel compute dtype. TRT-LLM is optimised for fp16/bf16; fp32 is not supported. [recompile]
fast_buildboolfalseEnable fast engine build mode (reduced optimisation, faster compilation). [recompile]
backendtrt | pytorch | _autodeploynull (TRT-LLM auto-picks)TRT-LLM runtime backend — a measurement axis, not a per-host knob. trt = AOT-compiled TensorRT engine (best steady-state, minutes-hours compile). pytorch = TRT-LLM's eager runtime (same scheduler/KV cache, no compile, supports newer archs without hand-written converters). _autodeploy = experimental model-porter. null → TRT-LLM auto-picks (respects TLLM_USE_TRT_ENGINE env). Distinct from the top-level engine: field.
engine_pathstrnullPath to a pre-compiled engine directory. When set, skips compilation and loads the engine directly. See Pre-Compiled Engine Loading below.

tensorrt.quant: Quantization

Quantization is applied at engine compile time — changing quant_algo triggers a recompile.

ParameterTypeDefaultDescription
quant_algosee belownull (no quantization)Quantization algorithm (native QuantAlgo enum name)
kv_cache_quant_algoFP8 | INT8nullKV cache quantization algorithm

Valid quant_algo values:

ValueDescription
FP8FP8 weight and activation quantization. Requires SM >= 8.9 (Ada Lovelace or Hopper). Not supported on A100 (SM 8.0).
INT8INT8 smooth quantization
W4A16_AWQ4-bit AWQ weight quantization, FP16 activations
W4A16_GPTQ4-bit GPTQ weight quantization, FP16 activations
W8A168-bit weight quantization, FP16 activations
W8A16_GPTQ8-bit GPTQ weight quantization, FP16 activations
W4A8_AWQ4-bit AWQ weight, INT8 activations
NO_QUANTExplicitly disable quantization

A100 note: A100 (SM 8.0) does not support FP8. Valid A100 quantization options: INT8, W4A16_AWQ, W4A16_GPTQ, W8A16, W8A16_GPTQ, W4A8_AWQ, NO_QUANT.

tensorrt.kv_cache: KV Cache

ParameterTypeDefaultDescription
enable_block_reuseboolfalseEnable KV cache block reuse across requests
free_gpu_memory_fractionfloat [0.0, 1.0]0.9Fraction of free GPU memory to allocate for KV cache
max_tokensintautoMaximum total tokens in the KV cache
host_cache_sizeint0Host (CPU) cache size in bytes for KV cache offloading (0 = disabled)

tensorrt.scheduler: Scheduler

ParameterTypeDefaultDescription
capacity_scheduling_policyGUARANTEED_NO_EVICT | MAX_UTILIZATION | STATIC_BATCHGUARANTEED_NO_EVICTScheduling capacity policy

Policy descriptions:

  • GUARANTEED_NO_EVICT — guarantees no request eviction; may reduce throughput
  • MAX_UTILIZATION — maximises GPU utilisation; may evict requests under memory pressure
  • STATIC_BATCH — fixed batch size; useful for reproducible benchmarking

Pre-Compiled Engine Loading

When tensorrt.engine_path is set, llem loads a pre-compiled TensorRT-LLM engine directly, skipping engine compilation entirely. This is useful when:

  • You have pre-built engines from a previous llm.save() call or a build pipeline
  • You want deterministic, fast experiment startup (no compilation overhead)
  • You are running sweeps where only runtime parameters vary (not engine shape)

Engine directory structure:

/path/to/engine/
config.json # Must contain 'pretrained_config' and 'build_config' keys
rank0.engine # Compiled engine binary for GPU 0
rank1.engine # Required if tp_size=2 (one file per rank)
tokenizer.json # Saved by llm.save() - enables self-contained engines
tokenizer_config.json

Validation checks (run before loading):

  1. Directory exists
  2. config.json exists and is valid JSON
  3. tp_size in config.json matches tensorrt.tensor_parallel_size (if detectable)
  4. rank{N}.engine files exist for each rank (0 to tensor_parallel_size-1)

What happens when engine_path is set:

  • model field is still required but used only as a fallback tokeniser source (if the engine directory lacks tokeniser files)
  • All compile-time parameters (max_batch_size, tensor_parallel_size, max_input_len, max_seq_len, dtype, fast_build) are ignored - they are baked into the engine
  • build_cache is ignored - no compilation occurs, so caching is irrelevant
  • Runtime parameters (kv_cache, scheduler, sampling) still apply
  • build_metadata.engine_path in the result indicates which engine was loaded

Example:

tensorrt:
engine_path: /engines/llama-7b-fp16-tp1
tensor_parallel_size: 1 # Must match the engine's tensor_parallel_size

Note: Engines are not portable across GPU architectures. An engine compiled on A100 (SM 8.0) will not load on H100 (SM 9.0) or vice versa. TRT-LLM will raise a clear error at load time if there is an architecture mismatch.

Tokeniser note: Engines built via llm.save() include tokeniser files and are self-contained. Engines built via the trtllm-build CLI may lack tokeniser files - in that case, the model field is used as a fallback tokeniser source.

tensorrt.sampling: TRT-LLM-Specific Sampling

These are TRT-LLM-specific extensions to SamplingParams. Universal sampling parameters (temperature, top_p, top_k, repetition_penalty) live in decoder: and are shared across all engines.

ParameterTypeDefaultDescription
min_tokensint0Minimum output tokens before EOS is allowed
nint1Number of output sequences per prompt
ignore_eosboolfalseContinue generating past EOS token (forces full max_output_tokens generation)

For advanced TRT-LLM parameters, see the TensorRT-LLM documentation.


Switching Between Engines

Change the engine: field and add the required runner config. The model and measurement parameters stay the same.

# Same experiment — Transformers (local)
model: gpt2
engine: transformers
n: 100
dtype: bfloat16
# Same experiment — vLLM (Docker)
model: gpt2
engine: vllm
n: 100
dtype: bfloat16
runners:
vllm: docker
# Same experiment — TensorRT-LLM (Docker)
model: gpt2
engine: tensorrt
n: 100
dtype: bfloat16
runners:
tensorrt: docker

Changing engine: switches the inference engine. Engine-specific sections (pytorch:, vllm:, tensorrt:) are ignored when not running that engine. Universal parameters (n, dtype, decoder:, etc.) apply to all engines.

Runner Configuration

The runner determines where each engine executes. Configure runners in three ways (listed in precedence order, highest first):

1. Environment variable — overrides all other settings:

LLEM_RUNNER_VLLM=docker llem run study.yaml
LLEM_RUNNER_VLLM=docker:my-registry/llem-vllm:custom llem run study.yaml

2. Per-study YAML — applies to that study only:

runners:
vllm: docker # use built-in default image
vllm: docker:my-registry/llem:custom # explicit image override

3. User config file — applies to all runs for that user:

# ~/.config/llenergymeasure/config.yaml
runners:
transformers: local
vllm: docker # always use Docker for vLLM

4. Defaultlocal for all engines if no runner is configured.

Auto-elevation. A study that mixes engines (e.g., pytorch local + vllm local) is automatically elevated to use Docker for vLLM when Docker is available. This is logged at info level and requires no user action.

Multi-engine without Docker is an error. If a study requires Docker runners and Docker is not set up, llem fails at pre-flight with a clear error before running any experiments.


Parameter Support Matrix

This matrix shows which parameters are supported by each engine. Derived from the Pydantic config models. Full runtime verification (with GPU test results) can be generated with uv run python scripts/generate_param_matrix.py after running GPU tests.

Universal Parameters (all engines)

These parameters live in ExperimentConfig and are shared across all engines:

ParameterTransformersvLLMTensorRT-LLMNotes
modelYesYesYesHuggingFace model ID or local path
engineYesYesYesSelects the inference engine
dataset.n_promptsYesYesYesNumber of prompts
dtypeYesYesYesfp32, fp16, bf16
dataset.sourceYesYesYesDataset source (built-in alias or .jsonl path)
max_input_tokensYesYesYesInput sequence length cap
max_output_tokensYesYesYesOutput token budget
random_seedYesYesYesPer-experiment seed: inference RNG, dataset ordering
decoder.temperatureYesYesYesSampling temperature
decoder.top_pYesYesYesNucleus sampling threshold
decoder.top_kYesYesYesTop-k sampling (0 = disabled)
decoder.repetition_penaltyYesYesYesRepetition penalty
decoder.presetYesYesYesdeterministic, creative, balanced

Transformers-Specific Parameters

ParameterTransformersvLLMTensorRT-LLMNotes
transformers.batch_sizeYesN/AN/ATransformers batching
transformers.attn_implementationYesN/AN/AAttention kernel selection
transformers.torch_compileYesN/AN/Atorch.compile acceleration
transformers.load_in_4bitYesN/AN/ABitsAndBytes 4-bit quantization
transformers.load_in_8bitYesN/AN/ABitsAndBytes 8-bit quantization
transformers.device_mapYesN/AN/ADevice placement strategy
transformers.num_beamsYesN/AN/ABeam search width
transformers.no_repeat_ngram_sizeYesN/AN/APrevent n-gram repetition
transformers.prompt_lookup_num_tokensYesN/AN/APrompt-lookup speculative decoding
transformers.tp_planYesN/AN/ANative HF tensor parallelism plan
transformers.tp_sizeYesN/AN/ATensor parallel rank count

vLLM-Specific Parameters

ParameterTransformersvLLMTensorRT-LLMNotes
vllm.engine.gpu_memory_utilizationN/AYesN/AKV cache memory fraction
vllm.engine.block_sizeN/AYesN/AKV cache block size
vllm.engine.kv_cache_dtypeN/AYesN/Afp8 KV cache on Ampere+
vllm.engine.enforce_eagerN/AYesN/ADisable CUDA graphs
vllm.engine.tensor_parallel_sizeN/AYesN/AMulti-GPU sharding
vllm.engine.quantizationN/AYesN/AAWQ, GPTQ, FP8, etc.
vllm.engine.speculative_modelN/AYesN/ADraft model for spec. decoding
vllm.sampling.max_tokensN/AYesN/AvLLM-specific max output tokens
vllm.sampling.presence_penaltyN/AYesN/APresence penalty
vllm.sampling.frequency_penaltyN/AYesN/AFrequency penalty
vllm.beam_search.beam_widthN/AYesN/AvLLM beam search

TensorRT-LLM-Specific Parameters

ParameterTransformersvLLMTensorRT-LLMNotes
tensorrt.max_batch_sizeN/AN/AYesCompile-time constant
tensorrt.tensor_parallel_sizeN/AN/AYesTensor parallel size (compile-time)
tensorrt.max_input_lenN/AN/AYesMax input tokens (compile-time)
tensorrt.max_seq_lenN/AN/AYesMax total sequence length (compile-time)
tensorrt.dtypeN/AN/AYesModel compute dtype (compile-time)
tensorrt.fast_buildN/AN/AYesFast build mode (compile-time)
tensorrt.engine_pathN/AN/AYesPre-compiled engine path
tensorrt.quant_config.quant_algoN/AN/AYesFP8, INT8, W4A16_AWQ, W4A16_GPTQ, W8A16, etc.
tensorrt.quant_config.kv_cache_quant_algoN/AN/AYesKV cache quantization: FP8 or INT8
tensorrt.kv_cache_config.free_gpu_memory_fractionN/AN/AYesKV cache memory fraction
tensorrt.kv_cache_config.enable_block_reuseN/AN/AYesKV cache block reuse
tensorrt.scheduler_config.capacity_scheduling_policyN/AN/AYesGUARANTEED_NO_EVICT / MAX_UTILIZATION / STATIC_BATCH
tensorrt.build_cache.max_cache_storage_gbN/AN/AYesEngine cache size limit
tensorrt.build_cache.cache_rootN/AN/AYesEngine cache directory
tensorrt.sampling.min_tokensN/AN/AYesMinimum output tokens
tensorrt.sampling.ignore_eosN/AN/AYesForce full generation past EOS
tensorrt.sampling.return_perf_metricsN/AN/AYesTRT-LLM internal perf metrics