Skip to main content

TensorRT-LLM Engine Schema

Engine version: 0.21.0
Discovered at: 2026-05-06T20:20:57+02:00
Discovery method: TrtLlmArgs.model_json_schema() + dataclasses.fields(SamplingParams)
Schema version: 1.0.0

Summary: 60 engine parameters, 47 sampling parameters.

Discovery limitations

  • engine_params — BuildConfig is not a Pydantic model; appears as Optional[object] in the schema Affected fields: build_config
  • sampling_params — SamplingParams is a dataclass; no per-field descriptions

Engine Parameters

FieldTypeDefaultDescription
modelstringThe path to the model checkpoint or the model name from the Hugging Face Hub.
tokenizer`stringNone`
tokenizer_modeLiteral['auto', 'slow']autoThe mode to initialize the tokenizer.
skip_tokenizer_initbooleanfalseWhether to skip the tokenizer initialization.
trust_remote_codebooleanfalseWhether to trust the remote code.
tensor_parallel_sizeinteger1The tensor parallel size.
dtypestringautoThe data type to use for the model.
revision`stringNone`
tokenizer_revision`stringNone`
pipeline_parallel_sizeinteger1The pipeline parallel size.
context_parallel_sizeinteger1The context parallel size.
gpus_per_node`integerNone`
moe_cluster_parallel_size`integerNone`
moe_tensor_parallel_size`integerNone`
moe_expert_parallel_size`integerNone`
enable_attention_dpbooleanfalseEnable attention data parallel.
cp_config`objectNone`
load_formatLiteral['auto', 'dummy']autoThe format to load the model.
enable_lorabooleanfalseEnable LoRA.
⚠️ max_lora_rank`integerNone`
⚠️ max_lorasinteger4The maximum number of LoRA.
⚠️ max_cpu_lorasinteger4The maximum number of LoRA on CPU.
lora_config`LoraConfigNone`
enable_prompt_adapterbooleanfalseEnable prompt adapter.
max_prompt_adapter_tokeninteger0The maximum number of prompt adapter tokens.
quant_config`QuantConfigNone`
kv_cache_configKvCacheConfigKV cache config.
enable_chunked_prefillbooleanfalseEnable chunked prefill.
guided_decoding_backend`stringNone`
batched_logits_processorOptional[tensorrt_llm.sampling_params.BatchedLogitsProcessor]Batched logits processor.
iter_stats_max_iterations`integerNone`
request_stats_max_iterations`integerNone`
peft_cache_config`PeftCacheConfigNone`
scheduler_configSchedulerConfigScheduler config.
cache_transceiver_config`CacheTransceiverConfigNone`
speculative_config`LookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfig
batching_type`BatchingTypeNone`
normalize_log_probsbooleanfalseNormalize log probabilities.
max_batch_size`integerNone`
max_input_len`integerNone`
max_seq_len`integerNone`
max_beam_width`integerNone`
max_num_tokens`integerNone`
gather_generation_logitsbooleanfalseGather generation logits.
num_postprocess_workersinteger0The number of processes used for postprocessing the generated tokens, including detokenization.
postprocess_tokenizer_dir`stringNone`
reasoning_parser`stringNone`
garbage_collection_gen0_thresholdinteger20000Threshold for Python garbage collection of generation 0 objects.Lower values trigger more frequent garbage collection.
⚠️ decoding_configOptional[DecodingConfig]The decoding config.
backend`stringNone`
⚠️ auto_parallelbooleanfalseEnable auto parallel mode.
⚠️ auto_parallel_world_size`integerNone`
enable_tqdmbooleanfalseEnable tqdm for progress bar.
workspace`stringNone`
enable_build_cacheUnion[tensorrt_llm.llmapi.build_cache.BuildCacheConfig, bool]falseEnable build cache.
extended_runtime_perf_knob_config`ExtendedRuntimePerfKnobConfigNone`
calib_config`CalibConfigNone`
embedding_parallel_modestringSHARDING_ALONG_VOCABThe embedding parallel mode.
fast_buildbooleanfalseEnable fast build.
build_configOptional[tensorrt_llm.builder.BuildConfig]Build config.

Sampling Parameters

FieldTypeDefaultDescription
end_id`intNone`
pad_id`intNone`
max_tokensint32
bad`strlist[str]None`
bad_token_ids`list[int]None`
stop`strlist[str]None`
stop_token_ids`list[int]None`
include_stop_str_in_outputboolfalse
embedding_bias`TensorNone`
logits_processor`LogitsProcessorlist[LogitsProcessor]None`
apply_batched_logits_processorboolfalse
nint1
best_of`intNone`
use_beam_searchboolfalse
top_k`intNone`
top_p`floatNone`
top_p_min`floatNone`
top_p_reset_ids`intNone`
top_p_decay`floatNone`
seed`intNone`
temperature`floatNone`
min_tokens`intNone`
beam_search_diversity_rate`floatNone`
repetition_penalty`floatNone`
presence_penalty`floatNone`
frequency_penalty`floatNone`
length_penalty`floatNone`
early_stopping`intNone`
no_repeat_ngram_size`intNone`
min_p`floatNone`
beam_width_array`list[int]None`
logprobs`intNone`
prompt_logprobs`intNone`
return_context_logitsboolfalse
return_generation_logitsboolfalse
exclude_input_from_outputbooltrue
return_encoder_outputboolfalse
return_perf_metricsboolfalse
additional_model_outputs`list[AdditionalModelOutput]None`
lookahead_config`LookaheadDecodingConfigNone`
guided_decoding`GuidedDecodingParamsNone`
ignore_eosboolfalse
detokenizebooltrue
add_special_tokensbooltrue
truncate_prompt_tokens`intNone`
skip_special_tokensbooltrue
spaces_between_special_tokensbooltrue