Skip to main content

Invalid Parameter Combinations

Auto-generated from config validators and test results. Last updated: 2026-04-20 23:10 UTC

This document lists parameter combinations that will fail validation or runtime. The tool validates these at config load time and provides clear error messages.

Config Validation Errors

These combinations are rejected at config load time with a clear error message.

EngineInvalid CombinationReasonResolution
transformersload_in_4bit=True + load_in_8bit=TrueCannot use both 4-bit and 8-bit quantization simultaneouslyChoose one: transformers.load_in_4bit=true OR transformers.load_in_8bit=true
transformerstorch_compile_mode without torch_compile=Truetorch_compile_mode/torch_compile_backend only take effect when torch_compile=TrueSet transformers.torch_compile=true when using torch_compile_mode or torch_compile_backend
transformersbnb_4bit_* without load_in_4bit=TrueBitsAndBytes 4-bit options require 4-bit quantization to be enabledSet transformers.load_in_4bit=true when using bnb_4bit_compute_dtype, bnb_4bit_quant_type, or bnb_4bit_use_double_quant
transformerscache_implementation with use_cache=FalseCannot specify a cache strategy when caching is explicitly disabledRemove use_cache=false or remove cache_implementation
allengine section mismatchEngine section must match the engine fieldEnsure transformers:/vllm:/tensorrt: section matches engine: field
allpassthrough_kwargs key collisionpassthrough_kwargs keys must not collide with ExperimentConfig fieldsUse named fields directly instead of passthrough_kwargs
tensorrtdtype=float32TensorRT-LLM is optimised for lower-precision inferenceUse dtype='float16' or 'bfloat16'
vllmload_in_4bit or load_in_8bitvLLM does not support bitsandbytes quantizationUse vllm.quantization (awq, gptq, fp8) for quantized inference

Streaming Mode Constraints

When streaming=True, certain parameters are ignored or behave differently because streaming requires sequential per-request processing to measure TTFT/ITL.

EngineParameterBehaviour with streaming=TrueImpact
alltransformers.batch_size / vllm.max_num_seqsIgnored - processes 1 request at a timeReduced throughput but accurate latency
transformerstransformers.torch_compileMay cause graph-tracing errorsFalls back to non-compiled inference
transformerstransformers.batching_strategyIgnored - always sequentialNo batching optimisation
vllmvllm.enable_chunked_prefillMay interfere with TTFT measurementConsider disabling for accurate TTFT

When to use streaming=True:

  • Measuring user-perceived latency (TTFT, ITL)
  • Evaluating real-time chat/assistant workloads
  • MLPerf inference latency benchmarks

When to use streaming=False:

  • Throughput benchmarking
  • Batch processing workloads
  • torch.compile optimisation testing

Runtime Limitations

These combinations pass config validation but may fail at runtime due to hardware, model, or package requirements.

EngineParameterLimitationResolution
transformerstransformers.attn_implementation=flash_attention_2flash-attn requires Ampere+ GPU (SM80+); fails on older architecturesUse attn_implementation='sdpa' on pre-Ampere GPUs
transformerstransformers.attn_implementation=flash_attention_3FA3 requires the flash_attn_3 package (built from flash-attn hopper/ directory) and Ampere+ GPU (SM80+). The Docker PyTorch image includes it pre-builtInstall flash_attn_3 from source, or use the Docker runner
vllmvllm.kv_cache_dtype=fp8FP8 KV cache requires Hopper (H100) or newer GPUUse kv_cache_dtype='auto' for automatic selection
vllmvllm.attention.engine=FLASHINFERFlashInfer requires JIT compilation on first useUse attention.engine='auto' or 'FLASH_ATTN'
vllmvllm.attention.engine=TORCH_SDPATORCH_SDPA not registered in vLLM attention backendsUse attention.engine='auto' or 'FLASH_ATTN'
vllmvllm.quantization_method=awq/gptqRequires a pre-quantized model checkpointUse a quantized model (e.g., TheBloke/*-AWQ) or omit
vllmvllm.load_format=ptModel checkpoint must have .bin files (not just safetensors)Use load_format='auto' or 'safetensors'
tensorrttensorrt.quant_config.quant_algo=FP8FP8 requires SM >= 8.9 (Ada Lovelace or Hopper). A100 (SM80) raises ConfigurationError - no silent emulation or fallbackUse INT8, W4A16_AWQ, W4A16_GPTQ, or W8A16 on A100
tensorrttensorrt.quantization.method=int8_sqINT8 SmoothQuant requires calibration datasetProvide tensorrt.quantization.calibration config or use a supported quantization method

Engine Capability Matrix

FeatureTransformersvLLMTensorRT
Tensor ParallelYesNoYes
Data ParallelNoNoNo
BitsAndBytes (4-bit)NoYesNo
BitsAndBytes (8-bit)NoYesNo
Native QuantizationAWQ/GPTQ/FP8NoINT8/W4A16_AWQ/W4A16_GPTQ/FP8
float32 precisionYesYesNo
float16 precisionYesYesYes
bfloat16 precisionYesYesYes
Prefix CachingYesNoNo
LoRA AdaptersNoYesNo
torch.compileNoYesNo
Beam SearchYesYesNo
Speculative DecodingNoYesNo
Static KV CacheNoYesNo

Notes:

  • vLLM supports 4-bit via AWQ/GPTQ quantized models, not bitsandbytes
  • TensorRT-LLM is optimised for FP16/BF16/INT8, not FP32

Memory-Constrained (Consumer GPU)

engine: pytorch
quantization:
load_in_4bit: true
bnb_4bit_quant_type: nf4

High Throughput (Production)

engine: vllm
vllm:
gpu_memory_utilization: 0.9
enable_prefix_caching: true

Maximum Performance (Ampere+)

engine: tensorrt
fp_precision: float16
tensorrt:
quantization:
method: fp8 # Hopper only