The Impact of Quantization on vLLM Inference Performance

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

The purpose of this documentation is to evaluate commonly used quantization techniques in vLLM and how they affect inference performance, especially throughput in high concurrency scenarios.

Conclusions

Weight-activation quantization (e.g., FP8) provides significant performance improvements with minimal quality loss and is recommended for most scenarios. Static quantization delivers higher throughput than dynamic, with accuracy dependent on calibration data:
- Dynamic FP8: +14.8% TPS, -14.1% TTFT, -21.8% TPOT
- Static FP8: +26.7% TPS, -20.4% TTFT, -32.4% TPOT
Most quantization approaches can substantially improve throughput in VRAM-constrained scenarios, with up to +46% improvement observed in experiments.
Weight-only quantization causes performance degradation in TPS, TTFT, and TPOT due to dequantization overhead when VRAM is not constrained.
Among weight-only methods, AWQ and GPTQ deliver the best inference performance, while bitsandbytes and GGUF show poor performance or compatibility issues and are not recommended.
Default vLLM kernels for AWQ and GPTQ outperform custom kernel implementations in experimental testing.
KV-Cache quantization provides relatively modest throughput improvements compared to other optimization techniques.

Technical Background

Quantization Type

Quantization Type	Description
Weight-only	Only the weights are quantized after training; activations remain full-precision
Dynamic	Weights are pre-quantized; activations are quantized on-the-fly during inference
Static	Weights and activations are quantized ahead of time after calibration with a representative dataset
Quantization-aware Training	Simulates quantization during training so the model adapts to reduced precision

Calibration

Calibration is the step during quantization where the float32 ranges are computed. For weights it is quite easy since the actual range is known at quantization-time. But it is less clear for activations, and different approaches exist:

Post-training dynamic quantization: The range for each activation is computed on the fly at runtime. While this gives great results without too much work, it can be slower than static quantization because of the overhead introduced by computing the range each time. It is also not an option on certain hardware.
Post-training static quantization: The range for each activation is computed in advance at quantization-time, typically by passing representative data through the model and recording the activation values. In practice, the steps are: - Observers are put on activations to record their values - A certain number of forward passes on a calibration dataset is done (around 200 examples is enough) - The ranges for each computation are computed according to some calibration technique
Quantization-aware training: The range for each activation is computed at training-time, following the same idea as post-training static quantization. But "fake quantize" operators are used instead of observers: they record values just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.

KV Cache Quantization

KV cache quantization reduces memory footprint during inference by storing key-value cache in lower precision, allowing more tokens to be cached and improving throughput. vLLM supports FP8 datatypes (E4M3 and E5M2) but not INT8 KV cache. While research has explored 4-bit or 2-bit KV cache quantization, these approaches typically cause noticeable accuracy degradation such as reduced MMLU scores.

Quantization Kernels

vLLM offers multiple quantization kernel implementations for quantization methods like AWQ and GPTQ. For AWQ quantization, the official AWQ kernel serves as the default, while GPTQ models utilize the ExLlamaV2 kernel by default. Additional optimized kernels such as Marlin and Machete are also available, offering enhanced performance particularly for larger batch sizes.

Inference Quality Degradation

Quantization can lead to degradation in inference quality. While this documentation primarily focuses on performance metrics, we provide the following references for assessing quality impact:

Qwen3-8B model benchmarks on various quantization methods:

Quantization	CEVAL	MMLU	GSM8K	HUMANEVAL
BF16	79.27	74.78	87.79	63.41
FP8-Static	78.23	74.79	86.96	62.20
FP8-Dynamic	78.45	74.75	87.64	62.80
INT8-Dynamic	78.01	74.84	86.96	67.07
INT4-GPTQ	77.19	73.26	86.43	62.20
INT4-AWQ	76.15	73.59	86.96	63.41

For more details, please refer to the AngelSlim benchmarks.

Experimental Setup

Model: Qwen3-8B
Hardware: NVIDIA RTX 4090 24GB / A100 80GB / H100 80GB
vLLM Version: v0.9.2
Dataset: ShareGPT

Benchmark Script:

# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# Benchmark on ShareGPT dataset
vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Experimental Results

RTX 4090

Quantization Method	Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)
BF16	3869.3	15385.7	63.45
AWQ	5653.4 (+46.1%)	7913.4	87.7
AWQ Marlin	5536.7 (+43.1%)	8133.8	90.42
GPTQ (Int8)	4918.98 (+27.1%)	8415.70	96.07
GPTQ Marlin	5025.82 (+29.9%)	8143.93	93.05

A100 80GB

Quantization Method	Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)
BF16	10338.25	3412.85	200.02
GPTQ-Marlin (Int4)	8146.73	10336.24	261.81
GPTQ (Int4)	8129.27	10414.74	261.64
AWQ	9611.61	3950.64	249.06
AWQ Marlin	8066.03	10506.70	264.33
GPTQ Marlin (Int8)	7119.60	12359.22	309.37
GPTQ (Int8)	7100.46	12380.82	309.34
bitsandbytes	5916.34	9115.43	252.91
GGUF (Q4_K_M)	N/A	N/A	N/A

Note

GGUF model with architecture qwen3 is not supported.

H100 80GB

Quantization Method	Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)
FP8 Static	16452.52 (+26.7%)	4116.87 (-20.4%)	85.57 (-32.4%)
FP8 Dynamic	15275.64 (+14.8%)	4445.10 (-14.1%)	98.94 (-21.8%)
Int4-W4A16	13605.46	5302.14	130.54
BF16	13305.38	5172.78	126.55
AWQ	9756.06	8794.29	209.13
GPTQ	N/A	N/A	N/A

Note

GPTQ implementation is broken, see vLLM issue.

KV Cache Quantization

Configuration	Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)
BF16 (Baseline)	13305.38	5172.78	126.55
BF16 + KV Cache FP8 (E4M3)	13513.19 (+1.6%)	5688.39	103.21