Engine Configuration
llenergymeasure supports multiple inference engines. each engine uses a different runtime
and requires different setup. Currently active: Transformers (local), vLLM (Docker), and
TensorRT-LLM (Docker). Planned: SGLang (M5).
Engine Overview
| Engine | Runner | GPU Required | Status |
|---|---|---|---|
| Transformers | local | Yes | Active |
| vLLM | docker | Yes | Active |
| TensorRT-LLM | docker | Yes | Active |
| SGLang | docker | Yes | Planned (M5) |
Runner determines where the engine executes: local runs directly on the host,
docker launches an isolated container.
Transformers (Local)
The default engine. Runs the HuggingFace transformers AutoModelForCausalLM stack directly
on the host. No Docker required.
Minimal config:
model: gpt2
engine: transformers
With Transformers-specific options:
model: gpt2
engine: transformers
n: 100
dtype: bfloat16
transformers:
batch_size: 4
attn_implementation: sdpa
torch_compile: false
load_in_4bit: false
Transformers Parameters
All pytorch: fields default to null — null means "use the engine's own default".
Unknown fields under pytorch: are forwarded to HuggingFace APIs.
Batching:
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | int | 1 | Number of prompts processed per forward pass |
Attention:
| Parameter | Type | Default | Description |
|---|---|---|---|
attn_implementation | sdpa | flash_attention_2 | flash_attention_3 | eager | sdpa | Attention kernel |
Note: flash_attention_3 requires the flash_attn_3 package (built separately from the
flash-attn repo's hopper/ directory) and an Ampere+ GPU (SM80+, e.g. A100 or H100).
The Docker Transformers image includes FA3 by default. To skip it (e.g. for faster CI builds),
rebuild with --build-arg INSTALL_FA3=false. See
Installation - FlashAttention-3 for details.
Compilation:
| Parameter | Type | Default | Description |
|---|---|---|---|
torch_compile | bool | false | Enable torch.compile |
torch_compile_mode | str | default | Compile mode: default, reduce-overhead, max-autotune |
torch_compile_backend | str | inductor | Compile backend |
Note: torch_compile_mode and torch_compile_backend require torch_compile: true.
Quantization (BitsAndBytes):
| Parameter | Type | Default | Description |
|---|---|---|---|
load_in_4bit | bool | false | BitsAndBytes 4-bit quantization (NF4) |
load_in_8bit | bool | false | BitsAndBytes 8-bit quantization |
bnb_4bit_compute_dtype | float16 | bfloat16 | float32 | float32 | Compute dtype for 4-bit (usually set to bfloat16) |
bnb_4bit_quant_type | nf4 | fp4 | nf4 | 4-bit quantization type |
bnb_4bit_use_double_quant | bool | false | Double quantization (saves ~0.4 bits/param) |
Note: load_in_4bit and load_in_8bit are mutually exclusive.
bnb_4bit_* fields require load_in_4bit: true.
KV Cache:
| Parameter | Type | Default | Description |
|---|---|---|---|
use_cache | bool | true | Enable KV cache during generation |
cache_implementation | static | offloaded_static | sliding_window | dynamic | KV cache strategy; static enables CUDA graphs |
Model Loading:
| Parameter | Type | Default | Description |
|---|---|---|---|
device_map | str | auto | Device placement strategy |
max_memory | dict | null | Per-device memory limits, e.g. {0: "10GiB", cpu: "50GiB"} |
low_cpu_mem_usage | bool | true | Load weights incrementally to minimise peak CPU RAM |
Floating-Point Precision:
| Parameter | Type | Default | Description |
|---|---|---|---|
allow_tf32 | bool | null | Allow TF32 for matrix multiplications on Ampere+ (affects energy/throughput) |
autocast_enabled | bool | false | Enable torch.autocast during generation |
autocast_dtype | float16 | bfloat16 | bfloat16 | AMP dtype (only used when autocast_enabled: true) |
Beam Search:
| Parameter | Type | Default | Description |
|---|---|---|---|
num_beams | int | 1 | Beam search width (1 = greedy/sampling) |
early_stopping | bool | false | Stop when all beams hit EOS |
length_penalty | float | 1.0 | Length penalty: >1 shorter, <1 longer |
N-gram Repetition:
| Parameter | Type | Default | Description |
|---|---|---|---|
no_repeat_ngram_size | int | 0 | Prevent n-gram repetition (0 = disabled) |
Speculative Decoding:
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt_lookup_num_tokens | int | null | Prompt-lookup speculative decoding (disabled when null) |
Tensor Parallelism (HF Transformers >= 4.50):
| Parameter | Type | Default | Description |
|---|---|---|---|
tp_plan | auto | null | null | Native HF tensor parallelism plan. Mutually exclusive with device_map. Requires torchrun launch. |
tp_size | int | WORLD_SIZE | Number of tensor parallel ranks. Only used when tp_plan is set. |
Note: tp_plan and device_map are mutually exclusive — tensor parallelism handles its own
device placement. When tp_plan='auto', device_map is automatically omitted.
vLLM (Docker)
A high-throughput inference engine using PagedAttention and continuous batching. Requires Docker with NVIDIA Container Toolkit. See Docker Setup Guide for installation instructions.
Minimal config:
model: gpt2
engine: vllm
runners:
vllm: docker
With vLLM-specific options:
model: gpt2
engine: vllm
n: 100
runners:
vllm: docker
vllm:
engine:
enforce_eager: false
gpu_memory_utilization: 0.9
block_size: 16
kv_cache_dtype: auto
sampling:
max_tokens: 128
presence_penalty: 0.0
Nested structure required. vLLM config uses a nested
engine:/sampling:structure that mirrors vLLM's own API separation. Flatvllm:configs (from pre-M3 versions) are not supported.
vLLM Engine Parameters
vllm.engine: fields map to vllm.LLM() constructor arguments. These are set at model
initialisation time. All fields default to null (use vLLM's own default).
Memory Management:
| Parameter | Type | Default | Description |
|---|---|---|---|
gpu_memory_utilization | float [0.0, 1.0) | 0.9 | GPU memory fraction reserved for KV cache |
swap_space | float | 4 GiB | CPU swap space in GiB for KV cache offloading |
cpu_offload_gb | float | 0 | CPU RAM in GiB to offload model weights to |
KV Cache:
| Parameter | Type | Default | Description |
|---|---|---|---|
block_size | 8 | 16 | 32 | 16 | KV cache block size in tokens |
kv_cache_dtype | auto | fp8 | fp8_e5m2 | fp8_e4m3 | auto | KV cache storage dtype; fp8 halves VRAM on Ampere+ |
kv_cache_memory_bytes | int | null | Absolute KV cache size in bytes (mutually exclusive with gpu_memory_utilization) |
Execution Mode:
| Parameter | Type | Default | Description |
|---|---|---|---|
enforce_eager | bool | false | Disable CUDA graphs, always use eager mode |
enable_chunked_prefill | bool | false | Chunk large prefills across scheduler iterations |
Scheduler / Batching:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_num_seqs | int | 256 | Max concurrent sequences per scheduler iteration |
max_num_batched_tokens | int | auto | Max tokens processed per scheduler iteration |
max_model_len | int | model default | Max sequence length (input + output tokens) |
Parallelism:
| Parameter | Type | Default | Description |
|---|---|---|---|
tensor_parallel_size | int | 1 | Number of GPUs to shard the model across |
pipeline_parallel_size | int | 1 | Pipeline parallel stages |
Quantization and Prefix Caching:
| Parameter | Type | Default | Description |
|---|---|---|---|
quantization | awq | gptq | fp8 | marlin | bitsandbytes | ... | null | Quantization method (requires pre-quantised checkpoint) |
enable_prefix_caching | bool | false | Automatic prefix caching for shared prompt prefixes |
Scheduler Tuning:
| Parameter | Type | Default | Description |
|---|---|---|---|
num_scheduler_steps | int | 1 | Multi-step scheduling: run N decode steps before returning to the scheduler (increases throughput at the cost of more VRAM per step) |
max_seq_len_to_capture | int | 8192 | Maximum sequence length eligible for CUDA graph capture (sequences longer than this fall back to eager mode) |
Distributed Execution:
| Parameter | Type | Default | Description |
|---|---|---|---|
distributed_executor_backend | mp | ray | mp | Multi-GPU executor backend (mp = multiprocessing, ray = Ray cluster) |
Speculative Decoding:
| Parameter | Type | Default | Description |
|---|---|---|---|
speculative_config | VLLMSpeculativeConfig | null | Speculative decoding sub-config (see below) |
speculative_config sub-config fields:
| Field | Type | Default | Description |
|---|---|---|---|
model | str | null | HF model name or path for the draft model |
num_speculative_tokens | int | null | Tokens to draft per step |
method | str | null | Speculative method (e.g. eagle, medusa; null = draft-model mode) |
vLLM Sampling Parameters
vllm.sampling: fields map to vLLM-specific SamplingParams extensions.
Universal sampling parameters (temperature, top_p, top_k, repetition_penalty) live in
decoder: and are shared across all engines.
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens | int | uses max_output_tokens | Max output tokens (overrides max_output_tokens for vLLM sweeps) |
min_tokens | int | 0 | Minimum output tokens before EOS is allowed |
presence_penalty | float [-2.0, 2.0] | 0.0 | Penalises tokens that appear at all |
frequency_penalty | float [-2.0, 2.0] | 0.0 | Penalises tokens proportional to their frequency |
ignore_eos | bool | false | Continue generating past EOS (forces max_tokens generation) |
n | int | 1 | Number of output sequences per prompt |
vLLM Beam Search
When using beam search with vLLM, use vllm.beam_search: instead of vllm.sampling:.
The two sections are mutually exclusive.
vllm:
engine:
enforce_eager: false
beam_search:
beam_width: 4
length_penalty: 1.0
early_stopping: false
max_tokens: 128
Passthrough for New vLLM Parameters
Unknown fields under vllm.engine: or vllm.sampling: are forwarded to vLLM's native
LLM() and SamplingParams() APIs. This lets you use new vLLM parameters without waiting
for llenergymeasure to add explicit support:
vllm:
engine:
some_new_vllm_param: true # forwarded directly to vllm.LLM()
sampling:
another_new_param: 0.5 # forwarded directly to vllm.SamplingParams()
TensorRT-LLM (Docker)
A maximum-performance inference engine using NVIDIA TensorRT engine compilation. TRT-LLM compiles a model into an optimised TensorRT engine on first use, then runs inference against that engine. Engines are cached on disk so subsequent runs skip compilation. Requires Docker with NVIDIA Container Toolkit. See Docker Setup Guide for installation instructions.
Minimal config:
model: meta-llama/Llama-2-7b-hf
engine: tensorrt
runners:
tensorrt: docker
With TensorRT-LLM-specific options:
model: meta-llama/Llama-2-7b-hf
engine: tensorrt
n: 50
dtype: bfloat16
runners:
tensorrt: docker
tensorrt:
tensor_parallel_size: 2
max_batch_size: 8
max_num_tokens: 4096
dtype: bfloat16
quant:
quant_algo: W4A16_AWQ
Engine compilation on first run. The first run with a given config will compile a TensorRT engine (which may take several minutes). Subsequent runs with the same config use the cached engine and start much faster.
Compile-Time Parameters
These parameters define the engine shape and cannot be changed without recompiling. Changing any [recompile] field invalidates the cached engine and triggers a new build.
| Parameter | Type | Default | Description |
|---|---|---|---|
max_batch_size | int | 8 | Maximum batch size the engine accepts. [recompile] |
tensor_parallel_size | int | 1 | Tensor parallel degree (number of GPUs). [recompile] |
pipeline_parallel_size | int | 1 | Pipeline parallel stages (number of pipeline stages across GPUs). [recompile] |
max_input_len | int | 1024 | Maximum input sequence length in tokens. [recompile] |
max_seq_len | int | 2048 | Maximum total sequence length (input + output). [recompile] |
max_num_tokens | int | auto | Maximum tokens the engine handles per iteration (scheduler throughput axis alongside max_batch_size). [recompile] |
dtype | float16 | bfloat16 | auto | Model compute dtype. TRT-LLM is optimised for fp16/bf16; fp32 is not supported. [recompile] |
fast_build | bool | false | Enable fast engine build mode (reduced optimisation, faster compilation). [recompile] |
backend | trt | pytorch | _autodeploy | null (TRT-LLM auto-picks) | TRT-LLM runtime backend — a measurement axis, not a per-host knob. trt = AOT-compiled TensorRT engine (best steady-state, minutes-hours compile). pytorch = TRT-LLM's eager runtime (same scheduler/KV cache, no compile, supports newer archs without hand-written converters). _autodeploy = experimental model-porter. null → TRT-LLM auto-picks (respects TLLM_USE_TRT_ENGINE env). Distinct from the top-level engine: field. |
engine_path | str | null | Path to a pre-compiled engine directory. When set, skips compilation and loads the engine directly. See Pre-Compiled Engine Loading below. |
tensorrt.quant: Quantization
Quantization is applied at engine compile time — changing quant_algo triggers a recompile.
| Parameter | Type | Default | Description |
|---|---|---|---|
quant_algo | see below | null (no quantization) | Quantization algorithm (native QuantAlgo enum name) |
kv_cache_quant_algo | FP8 | INT8 | null | KV cache quantization algorithm |
Valid quant_algo values:
| Value | Description |
|---|---|
FP8 | FP8 weight and activation quantization. Requires SM >= 8.9 (Ada Lovelace or Hopper). Not supported on A100 (SM 8.0). |
INT8 | INT8 smooth quantization |
W4A16_AWQ | 4-bit AWQ weight quantization, FP16 activations |
W4A16_GPTQ | 4-bit GPTQ weight quantization, FP16 activations |
W8A16 | 8-bit weight quantization, FP16 activations |
W8A16_GPTQ | 8-bit GPTQ weight quantization, FP16 activations |
W4A8_AWQ | 4-bit AWQ weight, INT8 activations |
NO_QUANT | Explicitly disable quantization |
A100 note: A100 (SM 8.0) does not support FP8. Valid A100 quantization options:
INT8,W4A16_AWQ,W4A16_GPTQ,W8A16,W8A16_GPTQ,W4A8_AWQ,NO_QUANT.
tensorrt.kv_cache: KV Cache
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_block_reuse | bool | false | Enable KV cache block reuse across requests |
free_gpu_memory_fraction | float [0.0, 1.0] | 0.9 | Fraction of free GPU memory to allocate for KV cache |
max_tokens | int | auto | Maximum total tokens in the KV cache |
host_cache_size | int | 0 | Host (CPU) cache size in bytes for KV cache offloading (0 = disabled) |
tensorrt.scheduler: Scheduler
| Parameter | Type | Default | Description |
|---|---|---|---|
capacity_scheduling_policy | GUARANTEED_NO_EVICT | MAX_UTILIZATION | STATIC_BATCH | GUARANTEED_NO_EVICT | Scheduling capacity policy |
Policy descriptions:
GUARANTEED_NO_EVICT— guarantees no request eviction; may reduce throughputMAX_UTILIZATION— maximises GPU utilisation; may evict requests under memory pressureSTATIC_BATCH— fixed batch size; useful for reproducible benchmarking
Pre-Compiled Engine Loading
When tensorrt.engine_path is set, llem loads a pre-compiled TensorRT-LLM engine
directly, skipping engine compilation entirely. This is useful when:
- You have pre-built engines from a previous
llm.save()call or a build pipeline - You want deterministic, fast experiment startup (no compilation overhead)
- You are running sweeps where only runtime parameters vary (not engine shape)
Engine directory structure:
/path/to/engine/
config.json # Must contain 'pretrained_config' and 'build_config' keys
rank0.engine # Compiled engine binary for GPU 0
rank1.engine # Required if tp_size=2 (one file per rank)
tokenizer.json # Saved by llm.save() - enables self-contained engines
tokenizer_config.json
Validation checks (run before loading):
- Directory exists
config.jsonexists and is valid JSONtp_sizeinconfig.jsonmatchestensorrt.tensor_parallel_size(if detectable)rank{N}.enginefiles exist for each rank (0 totensor_parallel_size-1)
What happens when engine_path is set:
modelfield is still required but used only as a fallback tokeniser source (if the engine directory lacks tokeniser files)- All compile-time parameters (
max_batch_size,tensor_parallel_size,max_input_len,max_seq_len,dtype,fast_build) are ignored - they are baked into the engine build_cacheis ignored - no compilation occurs, so caching is irrelevant- Runtime parameters (
kv_cache,scheduler,sampling) still apply build_metadata.engine_pathin the result indicates which engine was loaded
Example:
tensorrt:
engine_path: /engines/llama-7b-fp16-tp1
tensor_parallel_size: 1 # Must match the engine's tensor_parallel_size
Note: Engines are not portable across GPU architectures. An engine compiled on A100 (SM 8.0) will not load on H100 (SM 9.0) or vice versa. TRT-LLM will raise a clear error at load time if there is an architecture mismatch.
Tokeniser note: Engines built via
llm.save()include tokeniser files and are self-contained. Engines built via thetrtllm-buildCLI may lack tokeniser files - in that case, themodelfield is used as a fallback tokeniser source.
tensorrt.sampling: TRT-LLM-Specific Sampling
These are TRT-LLM-specific extensions to SamplingParams. Universal sampling parameters
(temperature, top_p, top_k, repetition_penalty) live in decoder: and are shared across
all engines.
| Parameter | Type | Default | Description |
|---|---|---|---|
min_tokens | int | 0 | Minimum output tokens before EOS is allowed |
n | int | 1 | Number of output sequences per prompt |
ignore_eos | bool | false | Continue generating past EOS token (forces full max_output_tokens generation) |
For advanced TRT-LLM parameters, see the TensorRT-LLM documentation.
Switching Between Engines
Change the engine: field and add the required runner config. The model and measurement
parameters stay the same.
# Same experiment — Transformers (local)
model: gpt2
engine: transformers
n: 100
dtype: bfloat16
# Same experiment — vLLM (Docker)
model: gpt2
engine: vllm
n: 100
dtype: bfloat16
runners:
vllm: docker
# Same experiment — TensorRT-LLM (Docker)
model: gpt2
engine: tensorrt
n: 100
dtype: bfloat16
runners:
tensorrt: docker
Changing engine: switches the inference engine. Engine-specific sections (pytorch:,
vllm:, tensorrt:) are ignored when not running that engine. Universal parameters (n,
dtype, decoder:, etc.) apply to all engines.
Runner Configuration
The runner determines where each engine executes. Configure runners in three ways (listed in precedence order, highest first):
1. Environment variable — overrides all other settings:
LLEM_RUNNER_VLLM=docker llem run study.yaml
LLEM_RUNNER_VLLM=docker:my-registry/llem-vllm:custom llem run study.yaml
2. Per-study YAML — applies to that study only:
runners:
vllm: docker # use built-in default image
vllm: docker:my-registry/llem:custom # explicit image override
3. User config file — applies to all runs for that user:
# ~/.config/llenergymeasure/config.yaml
runners:
transformers: local
vllm: docker # always use Docker for vLLM
4. Default — local for all engines if no runner is configured.
Auto-elevation. A study that mixes engines (e.g., pytorch local + vllm local) is
automatically elevated to use Docker for vLLM when Docker is available. This is logged at
info level and requires no user action.
Multi-engine without Docker is an error. If a study requires Docker runners and Docker
is not set up, llem fails at pre-flight with a clear error before running any experiments.
Parameter Support Matrix
This matrix shows which parameters are supported by each engine. Derived from the Pydantic
config models. Full runtime verification (with GPU test results) can be generated with
uv run python scripts/generate_param_matrix.py after running GPU tests.
Universal Parameters (all engines)
These parameters live in ExperimentConfig and are shared across all engines:
| Parameter | Transformers | vLLM | TensorRT-LLM | Notes |
|---|---|---|---|---|
model | Yes | Yes | Yes | HuggingFace model ID or local path |
engine | Yes | Yes | Yes | Selects the inference engine |
dataset.n_prompts | Yes | Yes | Yes | Number of prompts |
dtype | Yes | Yes | Yes | fp32, fp16, bf16 |
dataset.source | Yes | Yes | Yes | Dataset source (built-in alias or .jsonl path) |
max_input_tokens | Yes | Yes | Yes | Input sequence length cap |
max_output_tokens | Yes | Yes | Yes | Output token budget |
random_seed | Yes | Yes | Yes | Per-experiment seed: inference RNG, dataset ordering |
decoder.temperature | Yes | Yes | Yes | Sampling temperature |
decoder.top_p | Yes | Yes | Yes | Nucleus sampling threshold |
decoder.top_k | Yes | Yes | Yes | Top-k sampling (0 = disabled) |
decoder.repetition_penalty | Yes | Yes | Yes | Repetition penalty |
decoder.preset | Yes | Yes | Yes | deterministic, creative, balanced |
Transformers-Specific Parameters
| Parameter | Transformers | vLLM | TensorRT-LLM | Notes |
|---|---|---|---|---|
transformers.batch_size | Yes | N/A | N/A | Transformers batching |
transformers.attn_implementation | Yes | N/A | N/A | Attention kernel selection |
transformers.torch_compile | Yes | N/A | N/A | torch.compile acceleration |
transformers.load_in_4bit | Yes | N/A | N/A | BitsAndBytes 4-bit quantization |
transformers.load_in_8bit | Yes | N/A | N/A | BitsAndBytes 8-bit quantization |
transformers.device_map | Yes | N/A | N/A | Device placement strategy |
transformers.num_beams | Yes | N/A | N/A | Beam search width |
transformers.no_repeat_ngram_size | Yes | N/A | N/A | Prevent n-gram repetition |
transformers.prompt_lookup_num_tokens | Yes | N/A | N/A | Prompt-lookup speculative decoding |
transformers.tp_plan | Yes | N/A | N/A | Native HF tensor parallelism plan |
transformers.tp_size | Yes | N/A | N/A | Tensor parallel rank count |
vLLM-Specific Parameters
| Parameter | Transformers | vLLM | TensorRT-LLM | Notes |
|---|---|---|---|---|
vllm.engine.gpu_memory_utilization | N/A | Yes | N/A | KV cache memory fraction |
vllm.engine.block_size | N/A | Yes | N/A | KV cache block size |
vllm.engine.kv_cache_dtype | N/A | Yes | N/A | fp8 KV cache on Ampere+ |
vllm.engine.enforce_eager | N/A | Yes | N/A | Disable CUDA graphs |
vllm.engine.tensor_parallel_size | N/A | Yes | N/A | Multi-GPU sharding |
vllm.engine.quantization | N/A | Yes | N/A | AWQ, GPTQ, FP8, etc. |
vllm.engine.speculative_model | N/A | Yes | N/A | Draft model for spec. decoding |
vllm.sampling.max_tokens | N/A | Yes | N/A | vLLM-specific max output tokens |
vllm.sampling.presence_penalty | N/A | Yes | N/A | Presence penalty |
vllm.sampling.frequency_penalty | N/A | Yes | N/A | Frequency penalty |
vllm.beam_search.beam_width | N/A | Yes | N/A | vLLM beam search |
TensorRT-LLM-Specific Parameters
| Parameter | Transformers | vLLM | TensorRT-LLM | Notes |
|---|---|---|---|---|
tensorrt.max_batch_size | N/A | N/A | Yes | Compile-time constant |
tensorrt.tensor_parallel_size | N/A | N/A | Yes | Tensor parallel size (compile-time) |
tensorrt.max_input_len | N/A | N/A | Yes | Max input tokens (compile-time) |
tensorrt.max_seq_len | N/A | N/A | Yes | Max total sequence length (compile-time) |
tensorrt.dtype | N/A | N/A | Yes | Model compute dtype (compile-time) |
tensorrt.fast_build | N/A | N/A | Yes | Fast build mode (compile-time) |
tensorrt.engine_path | N/A | N/A | Yes | Pre-compiled engine path |
tensorrt.quant_config.quant_algo | N/A | N/A | Yes | FP8, INT8, W4A16_AWQ, W4A16_GPTQ, W8A16, etc. |
tensorrt.quant_config.kv_cache_quant_algo | N/A | N/A | Yes | KV cache quantization: FP8 or INT8 |
tensorrt.kv_cache_config.free_gpu_memory_fraction | N/A | N/A | Yes | KV cache memory fraction |
tensorrt.kv_cache_config.enable_block_reuse | N/A | N/A | Yes | KV cache block reuse |
tensorrt.scheduler_config.capacity_scheduling_policy | N/A | N/A | Yes | GUARANTEED_NO_EVICT / MAX_UTILIZATION / STATIC_BATCH |
tensorrt.build_cache.max_cache_storage_gb | N/A | N/A | Yes | Engine cache size limit |
tensorrt.build_cache.cache_root | N/A | N/A | Yes | Engine cache directory |
tensorrt.sampling.min_tokens | N/A | N/A | Yes | Minimum output tokens |
tensorrt.sampling.ignore_eos | N/A | N/A | Yes | Force full generation past EOS |
tensorrt.sampling.return_perf_metrics | N/A | N/A | Yes | TRT-LLM internal perf metrics |