Invalid Parameter Combinations

Auto-generated from config validators and test results. Last updated: 2026-04-20 23:10 UTC

This document lists parameter combinations that will fail validation or runtime. The tool validates these at config load time and provides clear error messages.

Config Validation Errors

These combinations are rejected at config load time with a clear error message.

Engine	Invalid Combination	Reason	Resolution
transformers	`load_in_4bit=True + load_in_8bit=True`	Cannot use both 4-bit and 8-bit quantization simultaneously	Choose one: transformers.load_in_4bit=true OR transformers.load_in_8bit=true
transformers	`torch_compile_mode without torch_compile=True`	torch_compile_mode/torch_compile_backend only take effect when torch_compile=True	Set transformers.torch_compile=true when using torch_compile_mode or torch_compile_backend
transformers	`bnb_4bit_* without load_in_4bit=True`	BitsAndBytes 4-bit options require 4-bit quantization to be enabled	Set transformers.load_in_4bit=true when using bnb_4bit_compute_dtype, bnb_4bit_quant_type, or bnb_4bit_use_double_quant
transformers	`cache_implementation with use_cache=False`	Cannot specify a cache strategy when caching is explicitly disabled	Remove use_cache=false or remove cache_implementation
all	`engine section mismatch`	Engine section must match the engine field	Ensure transformers:/vllm:/tensorrt: section matches engine: field
all	`passthrough_kwargs key collision`	passthrough_kwargs keys must not collide with ExperimentConfig fields	Use named fields directly instead of passthrough_kwargs
tensorrt	`dtype=float32`	TensorRT-LLM is optimised for lower-precision inference	Use dtype='float16' or 'bfloat16'
vllm	`load_in_4bit or load_in_8bit`	vLLM does not support bitsandbytes quantization	Use vllm.quantization (awq, gptq, fp8) for quantized inference

Streaming Mode Constraints

When streaming=True, certain parameters are ignored or behave differently because streaming requires sequential per-request processing to measure TTFT/ITL.

Engine	Parameter	Behaviour with streaming=True	Impact
all	`transformers.batch_size / vllm.max_num_seqs`	Ignored - processes 1 request at a time	Reduced throughput but accurate latency
transformers	`transformers.torch_compile`	May cause graph-tracing errors	Falls back to non-compiled inference
transformers	`transformers.batching_strategy`	Ignored - always sequential	No batching optimisation
vllm	`vllm.enable_chunked_prefill`	May interfere with TTFT measurement	Consider disabling for accurate TTFT

When to use streaming=True:

Measuring user-perceived latency (TTFT, ITL)
Evaluating real-time chat/assistant workloads
MLPerf inference latency benchmarks

When to use streaming=False:

Throughput benchmarking
Batch processing workloads
torch.compile optimisation testing

Runtime Limitations

These combinations pass config validation but may fail at runtime due to hardware, model, or package requirements.

Engine	Parameter	Limitation	Resolution
transformers	`transformers.attn_implementation=flash_attention_2`	flash-attn requires Ampere+ GPU (SM80+); fails on older architectures	Use attn_implementation='sdpa' on pre-Ampere GPUs
transformers	`transformers.attn_implementation=flash_attention_3`	FA3 requires the flash_attn_3 package (built from flash-attn hopper/ directory) and Ampere+ GPU (SM80+). The Docker PyTorch image includes it pre-built	Install flash_attn_3 from source, or use the Docker runner
vllm	`vllm.kv_cache_dtype=fp8`	FP8 KV cache requires Hopper (H100) or newer GPU	Use kv_cache_dtype='auto' for automatic selection
vllm	`vllm.attention.engine=FLASHINFER`	FlashInfer requires JIT compilation on first use	Use attention.engine='auto' or 'FLASH_ATTN'
vllm	`vllm.attention.engine=TORCH_SDPA`	TORCH_SDPA not registered in vLLM attention backends	Use attention.engine='auto' or 'FLASH_ATTN'
vllm	`vllm.quantization_method=awq/gptq`	Requires a pre-quantized model checkpoint	Use a quantized model (e.g., TheBloke/*-AWQ) or omit
vllm	`vllm.load_format=pt`	Model checkpoint must have .bin files (not just safetensors)	Use load_format='auto' or 'safetensors'
tensorrt	`tensorrt.quant_config.quant_algo=FP8`	FP8 requires SM >= 8.9 (Ada Lovelace or Hopper). A100 (SM80) raises ConfigurationError - no silent emulation or fallback	Use INT8, W4A16_AWQ, W4A16_GPTQ, or W8A16 on A100
tensorrt	`tensorrt.quantization.method=int8_sq`	INT8 SmoothQuant requires calibration dataset	Provide tensorrt.quantization.calibration config or use a supported quantization method

Engine Capability Matrix

Feature	Transformers	vLLM	TensorRT
Tensor Parallel	Yes	No	Yes
Data Parallel	No	No	No
BitsAndBytes (4-bit)	No	Yes	No
BitsAndBytes (8-bit)	No	Yes	No
Native Quantization	AWQ/GPTQ/FP8	No	INT8/W4A16_AWQ/W4A16_GPTQ/FP8
float32 precision	Yes	Yes	No
float16 precision	Yes	Yes	Yes
bfloat16 precision	Yes	Yes	Yes
Prefix Caching	Yes	No	No
LoRA Adapters	No	Yes	No
torch.compile	No	Yes	No
Beam Search	Yes	Yes	No
Speculative Decoding	No	Yes	No
Static KV Cache	No	Yes	No

Notes:

vLLM supports 4-bit via AWQ/GPTQ quantized models, not bitsandbytes
TensorRT-LLM is optimised for FP16/BF16/INT8, not FP32

Recommended Configurations by Use Case

Memory-Constrained (Consumer GPU)

engine: pytorch
quantization:
  load_in_4bit: true
  bnb_4bit_quant_type: nf4

High Throughput (Production)

engine: vllm
vllm:
  gpu_memory_utilization: 0.9
  enable_prefix_caching: true

Maximum Performance (Ampere+)

engine: tensorrt
fp_precision: float16
tensorrt:
  quantization:
    method: fp8  # Hopper only

Config Validation Errors​

Streaming Mode Constraints​

Runtime Limitations​

Engine Capability Matrix​

Recommended Configurations by Use Case​

Memory-Constrained (Consumer GPU)​

High Throughput (Production)​

Maximum Performance (Ampere+)​