TensorRT-LLM Engine Schema

Engine version: 0.21.0
Discovered at: 2026-05-06T20:20:57+02:00
Discovery method: TrtLlmArgs.model_json_schema() + dataclasses.fields(SamplingParams)
Schema version: 1.0.0

Summary: 60 engine parameters, 47 sampling parameters.

Discovery limitations

engine_params — BuildConfig is not a Pydantic model; appears as Optional[object] in the schema Affected fields: build_config

sampling_params — SamplingParams is a dataclass; no per-field descriptions

Engine Parameters

Field	Type	Default	Description
`model`	`string`	—	The path to the model checkpoint or the model name from the Hugging Face Hub.
`tokenizer`	`string	None`	—
`tokenizer_mode`	`Literal['auto', 'slow']`	`auto`	The mode to initialize the tokenizer.
`skip_tokenizer_init`	`boolean`	`false`	Whether to skip the tokenizer initialization.
`trust_remote_code`	`boolean`	`false`	Whether to trust the remote code.
`tensor_parallel_size`	`integer`	`1`	The tensor parallel size.
`dtype`	`string`	`auto`	The data type to use for the model.
`revision`	`string	None`	—
`tokenizer_revision`	`string	None`	—
`pipeline_parallel_size`	`integer`	`1`	The pipeline parallel size.
`context_parallel_size`	`integer`	`1`	The context parallel size.
`gpus_per_node`	`integer	None`	—
`moe_cluster_parallel_size`	`integer	None`	—
`moe_tensor_parallel_size`	`integer	None`	—
`moe_expert_parallel_size`	`integer	None`	—
`enable_attention_dp`	`boolean`	`false`	Enable attention data parallel.
`cp_config`	`object	None`	—
`load_format`	`Literal['auto', 'dummy']`	`auto`	The format to load the model.
`enable_lora`	`boolean`	`false`	Enable LoRA.
⚠️ `max_lora_rank`	`integer	None`	—
⚠️ `max_loras`	`integer`	`4`	The maximum number of LoRA.
⚠️ `max_cpu_loras`	`integer`	`4`	The maximum number of LoRA on CPU.
`lora_config`	`LoraConfig	None`	—
`enable_prompt_adapter`	`boolean`	`false`	Enable prompt adapter.
`max_prompt_adapter_token`	`integer`	`0`	The maximum number of prompt adapter tokens.
`quant_config`	`QuantConfig	None`	—
`kv_cache_config`	`KvCacheConfig`	—	KV cache config.
`enable_chunked_prefill`	`boolean`	`false`	Enable chunked prefill.
`guided_decoding_backend`	`string	None`	—
`batched_logits_processor`	`Optional[tensorrt_llm.sampling_params.BatchedLogitsProcessor]`	—	Batched logits processor.
`iter_stats_max_iterations`	`integer	None`	—
`request_stats_max_iterations`	`integer	None`	—
`peft_cache_config`	`PeftCacheConfig	None`	—
`scheduler_config`	`SchedulerConfig`	—	Scheduler config.
`cache_transceiver_config`	`CacheTransceiverConfig	None`	—
`speculative_config`	`LookaheadDecodingConfig	MedusaDecodingConfig	EagleDecodingConfig
`batching_type`	`BatchingType	None`	—
`normalize_log_probs`	`boolean`	`false`	Normalize log probabilities.
`max_batch_size`	`integer	None`	—
`max_input_len`	`integer	None`	—
`max_seq_len`	`integer	None`	—
`max_beam_width`	`integer	None`	—
`max_num_tokens`	`integer	None`	—
`gather_generation_logits`	`boolean`	`false`	Gather generation logits.
`num_postprocess_workers`	`integer`	`0`	The number of processes used for postprocessing the generated tokens, including detokenization.
`postprocess_tokenizer_dir`	`string	None`	—
`reasoning_parser`	`string	None`	—
`garbage_collection_gen0_threshold`	`integer`	`20000`	Threshold for Python garbage collection of generation 0 objects.Lower values trigger more frequent garbage collection.
⚠️ `decoding_config`	`Optional[DecodingConfig]`	—	The decoding config.
`backend`	`string	None`	—
⚠️ `auto_parallel`	`boolean`	`false`	Enable auto parallel mode.
⚠️ `auto_parallel_world_size`	`integer	None`	—
`enable_tqdm`	`boolean`	`false`	Enable tqdm for progress bar.
`workspace`	`string	None`	—
`enable_build_cache`	`Union[tensorrt_llm.llmapi.build_cache.BuildCacheConfig, bool]`	`false`	Enable build cache.
`extended_runtime_perf_knob_config`	`ExtendedRuntimePerfKnobConfig	None`	—
`calib_config`	`CalibConfig	None`	—
`embedding_parallel_mode`	`string`	`SHARDING_ALONG_VOCAB`	The embedding parallel mode.
`fast_build`	`boolean`	`false`	Enable fast build.
`build_config`	`Optional[tensorrt_llm.builder.BuildConfig]`	—	Build config.

Sampling Parameters

Field	Type	Default	Description
`end_id`	`int	None`	—
`pad_id`	`int	None`	—
`max_tokens`	`int`	`32`
`bad`	`str	list[str]	None`
`bad_token_ids`	`list[int]	None`	—
`stop`	`str	list[str]	None`
`stop_token_ids`	`list[int]	None`	—
`include_stop_str_in_output`	`bool`	`false`
`embedding_bias`	`Tensor	None`	—
`logits_processor`	`LogitsProcessor	list[LogitsProcessor]	None`
`apply_batched_logits_processor`	`bool`	`false`
`n`	`int`	`1`
`best_of`	`int	None`	—
`use_beam_search`	`bool`	`false`
`top_k`	`int	None`	—
`top_p`	`float	None`	—
`top_p_min`	`float	None`	—
`top_p_reset_ids`	`int	None`	—
`top_p_decay`	`float	None`	—
`seed`	`int	None`	—
`temperature`	`float	None`	—
`min_tokens`	`int	None`	—
`beam_search_diversity_rate`	`float	None`	—
`repetition_penalty`	`float	None`	—
`presence_penalty`	`float	None`	—
`frequency_penalty`	`float	None`	—
`length_penalty`	`float	None`	—
`early_stopping`	`int	None`	—
`no_repeat_ngram_size`	`int	None`	—
`min_p`	`float	None`	—
`beam_width_array`	`list[int]	None`	—
`logprobs`	`int	None`	—
`prompt_logprobs`	`int	None`	—
`return_context_logits`	`bool`	`false`
`return_generation_logits`	`bool`	`false`
`exclude_input_from_output`	`bool`	`true`
`return_encoder_output`	`bool`	`false`
`return_perf_metrics`	`bool`	`false`
`additional_model_outputs`	`list[AdditionalModelOutput]	None`	—
`lookahead_config`	`LookaheadDecodingConfig	None`	—
`guided_decoding`	`GuidedDecodingParams	None`	—
`ignore_eos`	`bool`	`false`
`detokenize`	`bool`	`true`
`add_special_tokens`	`bool`	`true`
`truncate_prompt_tokens`	`int	None`	—
`skip_special_tokens`	`bool`	`true`
`spaces_between_special_tokens`	`bool`	`true`

Engine Parameters​

Sampling Parameters​

Engine Parameters

Sampling Parameters