The Impact of Quantization on vLLM Inference Performance
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
The purpose of this documentation is to evaluate commonly used quantization techniques in vLLM and how they affect inference performance, especially throughput in high concurrency scenarios.
Conclusions
-
Weight-activation quantization (e.g., FP8) provides significant performance improvements with minimal quality loss and is recommended for most scenarios. Static quantization delivers higher throughput than dynamic, with accuracy dependent on calibration data:
- Dynamic FP8: +14.8% TPS, -14.1% TTFT, -21.8% TPOT
- Static FP8: +26.7% TPS, -20.4% TTFT, -32.4% TPOT
-
Most quantization approaches can substantially improve throughput in VRAM-constrained scenarios, with up to +46% improvement observed in experiments.
-
Weight-only quantization causes performance degradation in TPS, TTFT, and TPOT due to dequantization overhead when VRAM is not constrained.
-
Among weight-only methods, AWQ and GPTQ deliver the best inference performance, while bitsandbytes and GGUF show poor performance or compatibility issues and are not recommended.
-
Default vLLM kernels for AWQ and GPTQ outperform custom kernel implementations in experimental testing.
-
KV-Cache quantization provides relatively modest throughput improvements compared to other optimization techniques.
Technical Background
Quantization Type
| Quantization Type | Description |
|---|---|
| Weight-only | Only the weights are quantized after training; activations remain full-precision |
| Dynamic | Weights are pre-quantized; activations are quantized on-the-fly during inference |
| Static | Weights and activations are quantized ahead of time after calibration with a representative dataset |
| Quantization-aware Training | Simulates quantization during training so the model adapts to reduced precision |
Calibration
Calibration is the step during quantization where the float32 ranges are computed. For weights it is quite easy since the actual range is known at quantization-time. But it is less clear for activations, and different approaches exist:
-
Post-training dynamic quantization: The range for each activation is computed on the fly at runtime. While this gives great results without too much work, it can be slower than static quantization because of the overhead introduced by computing the range each time. It is also not an option on certain hardware.
-
Post-training static quantization: The range for each activation is computed in advance at quantization-time, typically by passing representative data through the model and recording the activation values. In practice, the steps are: - Observers are put on activations to record their values - A certain number of forward passes on a calibration dataset is done (around 200 examples is enough) - The ranges for each computation are computed according to some calibration technique
-
Quantization-aware training: The range for each activation is computed at training-time, following the same idea as post-training static quantization. But "fake quantize" operators are used instead of observers: they record values just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.
KV Cache Quantization
KV cache quantization reduces memory footprint during inference by storing key-value cache in lower precision, allowing more tokens to be cached and improving throughput. vLLM supports FP8 datatypes (E4M3 and E5M2) but not INT8 KV cache. While research has explored 4-bit or 2-bit KV cache quantization, these approaches typically cause noticeable accuracy degradation such as reduced MMLU scores.
Quantization Kernels
vLLM offers multiple quantization kernel implementations for quantization methods like AWQ and GPTQ. For AWQ quantization, the official AWQ kernel serves as the default, while GPTQ models utilize the ExLlamaV2 kernel by default. Additional optimized kernels such as Marlin and Machete are also available, offering enhanced performance particularly for larger batch sizes.
Inference Quality Degradation
Quantization can lead to degradation in inference quality. While this documentation primarily focuses on performance metrics, we provide the following references for assessing quality impact:
Qwen3-8B model benchmarks on various quantization methods:
| Quantization | CEVAL | MMLU | GSM8K | HUMANEVAL |
|---|---|---|---|---|
| BF16 | 79.27 | 74.78 | 87.79 | 63.41 |
| FP8-Static | 78.23 | 74.79 | 86.96 | 62.20 |
| FP8-Dynamic | 78.45 | 74.75 | 87.64 | 62.80 |
| INT8-Dynamic | 78.01 | 74.84 | 86.96 | 67.07 |
| INT4-GPTQ | 77.19 | 73.26 | 86.43 | 62.20 |
| INT4-AWQ | 76.15 | 73.59 | 86.96 | 63.41 |
For more details, please refer to the AngelSlim benchmarks.
Experimental Setup
- Model: Qwen3-8B
- Hardware: NVIDIA RTX 4090 24GB / A100 80GB / H100 80GB
- vLLM Version: v0.9.2
- Dataset: ShareGPT
- Benchmark Script:
# Prepare the ShareGPT dataset wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # Benchmark on ShareGPT dataset vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000
Experimental Results
RTX 4090
| Quantization Method | Throughput (tok/s) | Mean TTFT (ms) | Mean TPOT (ms) |
|---|---|---|---|
| BF16 | 3869.3 | 15385.7 | 63.45 |
| AWQ | 5653.4 (+46.1%) | 7913.4 | 87.7 |
| AWQ Marlin | 5536.7 (+43.1%) | 8133.8 | 90.42 |
| GPTQ (Int8) | 4918.98 (+27.1%) | 8415.70 | 96.07 |
| GPTQ Marlin | 5025.82 (+29.9%) | 8143.93 | 93.05 |
A100 80GB
| Quantization Method | Throughput (tok/s) | Mean TTFT (ms) | Mean TPOT (ms) |
|---|---|---|---|
| BF16 | 10338.25 | 3412.85 | 200.02 |
| GPTQ-Marlin (Int4) | 8146.73 | 10336.24 | 261.81 |
| GPTQ (Int4) | 8129.27 | 10414.74 | 261.64 |
| AWQ | 9611.61 | 3950.64 | 249.06 |
| AWQ Marlin | 8066.03 | 10506.70 | 264.33 |
| GPTQ Marlin (Int8) | 7119.60 | 12359.22 | 309.37 |
| GPTQ (Int8) | 7100.46 | 12380.82 | 309.34 |
| bitsandbytes | 5916.34 | 9115.43 | 252.91 |
| GGUF (Q4_K_M) | N/A | N/A | N/A |
Note
GGUF model with architecture qwen3 is not supported.
H100 80GB
| Quantization Method | Throughput (tok/s) | Mean TTFT (ms) | Mean TPOT (ms) |
|---|---|---|---|
| FP8 Static | 16452.52 (+26.7%) | 4116.87 (-20.4%) | 85.57 (-32.4%) |
| FP8 Dynamic | 15275.64 (+14.8%) | 4445.10 (-14.1%) | 98.94 (-21.8%) |
| Int4-W4A16 | 13605.46 | 5302.14 | 130.54 |
| BF16 | 13305.38 | 5172.78 | 126.55 |
| AWQ | 9756.06 | 8794.29 | 209.13 |
| GPTQ | N/A | N/A | N/A |
Note
GPTQ implementation is broken, see vLLM issue.
KV Cache Quantization
| Configuration | Throughput (tok/s) | Mean TTFT (ms) | Mean TPOT (ms) |
|---|---|---|---|
| BF16 (Baseline) | 13305.38 | 5172.78 | 126.55 |
| BF16 + KV Cache FP8 (E4M3) | 13513.19 (+1.6%) | 5688.39 | 103.21 |