Getting Started

Prerequisites: Complete Installation first.

This guide has three tracks. Choose one based on your setup:

Quick Start (Local Transformers) — No Docker required. Get running in minutes.
Recommended Start (Docker + vLLM) — Full measurement experience with vLLM backend.
TensorRT-LLM Start (Docker) — Maximum performance with TensorRT engine compilation.

Track 1: Quick Start (Local Transformers)

Prerequisites

llenergymeasure installed (see Installation)
Docker + NVIDIA Container Toolkit installed — every engine, including Transformers, runs inside a per-engine Docker image (see development.md)
NVIDIA GPU available

1. Verify your environment

llem config

Check that the output shows your GPU detected and an energy sampler selected. Engines will show as "not installed" on host — that is expected; they run inside Docker. See Installation and development.md for the Docker build/run pattern.

2. Run your first experiment

llem run --model gpt2 -e pytorch

This runs GPT-2 (124M parameters). On first run, the model downloads from HuggingFace (~500 MB). Subsequent runs use the cache.

Default settings: 100 prompts, aienergyscore dataset, bfloat16 dtype.

You will see a progress indicator on stderr, then results printed to stdout:

Result: gpt2-pytorch-bf16-20240305-143022     # ← unique experiment ID

Energy                                          # ← GPU energy consumed
  Total          847 J                          # ← total joules for all 100 prompts
  Baseline       12.3 W                         # ← idle GPU power (subtracted from total)
  Adjusted       723 J                          # ← energy minus baseline × duration

Performance                                     # ← throughput and compute
  Throughput     312 tok/s                      # ← output tokens per second (all 100 prompts)
  FLOPs          4.21e+11 (roofline, medium)   # ← estimated FLOPs (method, confidence)

Timing                                          # ← wall-clock time
  Duration       1m 38s                         # ← total experiment wall time
  Warmup         5 prompts excluded             # ← thermal stabilisation prompts (not in metrics)

Reading the results

Field	What it measures
`Total` (J)	Raw GPU energy consumed during the experiment
`Baseline` (W)	Idle GPU power measured before the run
`Adjusted` (J)	Energy minus `Baseline × Duration` — net inference energy
`Throughput` (tok/s)	Output tokens generated per second across all prompts
`FLOPs`	Estimated floating-point operations (method and confidence shown)
`Duration`	Wall-clock time for the full experiment
`Warmup`	Number of prompts run for thermal stabilisation and excluded from metrics

3. Output files

Results are saved to results/ in the current directory by default:

results/
└── gpt2-pytorch-bf16-20240305-143022/
    └── result.json        # full result record (all metrics, config, metadata)

The JSON file is the scientific record — it contains all raw metrics, the resolved config, timestamps, and measurement warnings.

Specify a different output directory with --output:

llem run --model gpt2 -e pytorch --output /data/experiments

Track 2: Recommended Start (Docker + vLLM)

Prerequisites

llenergymeasure installed (host-side orchestrator)
Docker + NVIDIA Container Toolkit installed — see Docker Setup
vLLM Docker image built — see development.md

1. Create a config file

Create experiment.yaml:

model: gpt2
engine: vllm
n: 50
runners:
  vllm: docker

2. Run the experiment

llem run experiment.yaml

What happens:

Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility inside container, CUDA/driver compatibility.
The vLLM Docker image is pulled on first run (ghcr.io/henrycgbaker/llenergymeasure/vllm:v0.9.0).
The container launches, runs the experiment, and streams results back.
Results are printed to stdout and saved to results/.

3. Read the results

The output format is the same as the Transformers track. The key difference is engine: vllm in the experiment ID and result file.

Track 3: TensorRT-LLM (Docker)

TensorRT-LLM compiles models into optimised TensorRT engines, then runs inference against those engines. The first run compiles the engine (which may take several minutes); subsequent runs with the same config load the cached engine and are much faster.

Prerequisites

Docker + NVIDIA Container Toolkit installed — see Docker Setup
llenergymeasure installed (host-side orchestrator)
TensorRT-LLM Docker image built — see development.md
NVIDIA GPU with SM >= 7.5 (Turing or newer; e.g. RTX 2000-series, A100, H100)

1. Create a config file

Create experiment.yaml:

model: meta-llama/Llama-2-7b-hf
engine: tensorrt
n: 50
runners:
  tensorrt: docker

For a quantized run with engine caching configured explicitly:

model: meta-llama/Llama-2-7b-hf
engine: tensorrt
n: 50
runners:
  tensorrt: docker
tensorrt:
  max_batch_size: 8
  dtype: bfloat16
  quant:
    quant_algo: W4A16_AWQ
  build_cache:
    max_cache_storage_gb: 100

2. Run the experiment

llem run experiment.yaml

What happens:

Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility, SM version check.
The TensorRT-LLM Docker image is pulled on first run (ghcr.io/henrycgbaker/llenergymeasure/tensorrt:v0.9.0).
The container compiles the TensorRT engine from the model weights. First run only — this takes several minutes. Progress is shown in the terminal.
The compiled engine is cached on disk (~/.cache/tensorrt_llm inside the container, mounted from the host).
Inference runs against the compiled engine.
Results are printed to stdout and saved to results/.

Engine caching. The compiled engine is keyed to your config (model, dtype, max_batch_size, tp_size, etc.). Running the same experiment config again skips compilation and starts inference immediately. Changing any compile-time parameter triggers a new build.

3. Read the results

The output format is the same as other engines. The result file will include engine: tensorrt and a build_metadata section with engine compilation time, GPU architecture, and TRT-LLM version.

Using a Config File

For repeatability, store your experiment configuration in a YAML file.

Minimal config:

# experiment.yaml
model: gpt2
engine: transformers
n: 100

Run it:

llem run experiment.yaml

This is equivalent to llem run --model gpt2 -e pytorch -n 100. CLI flags override YAML values when both are provided.

For study sweeps (running multiple configurations), see the Study Configuration reference.

Next Steps

Study Configuration — run parameter sweeps across models, engines, and configurations
Docker Setup — set up Docker + NVIDIA Container Toolkit for vLLM/TensorRT-LLM
Engine Configuration — configure vLLM, TensorRT-LLM, and switch between engines
CLI Reference — all llem run and llem config flags

Track 1: Quick Start (Local Transformers)​

Prerequisites​

1. Verify your environment​

2. Run your first experiment​

Reading the results​

3. Output files​

Track 2: Recommended Start (Docker + vLLM)​

Prerequisites​

1. Create a config file​

2. Run the experiment​

3. Read the results​

Track 3: TensorRT-LLM (Docker)​

Prerequisites​

1. Create a config file​

2. Run the experiment​

3. Read the results​

Using a Config File​

Next Steps​

Track 1: Quick Start (Local Transformers)

Prerequisites

1. Verify your environment

2. Run your first experiment

Reading the results

3. Output files

Track 2: Recommended Start (Docker + vLLM)

Prerequisites

1. Create a config file

2. Run the experiment

3. Read the results

Track 3: TensorRT-LLM (Docker)

Prerequisites

1. Create a config file

2. Run the experiment

3. Read the results

Using a Config File

Next Steps