Optimizing GLM-4.6/GLM-4.5 Throughput on NVIDIA A100 GPUs

Conclusion

Recommended configuration for optimizing throughput of GLM-4.6 on A100 GPUs:

Serving Command

vllm serve zai-org/GLM-4.6-FP8 -tp 8

Note

Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum.
There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, etc. The conclusions in this document can serve as a starting point for more targeted optimizations.
The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions.
Although using quantization may impact accuracy. FP8 quantization can achieves less than 1% accuracy drop for most models. See the evaluation results for more details. Therefore, it is highly recommended to use FP8 quantization for high-throughput serving scenarios.

If there are any missing points or updates reflecting new changes, please let us know.

Optimization Objective

Achieve high throughput under high-concurrency request scenarios.

Experimental Setup

Model

zai-org/GLM-4.6

Hardware

NVIDIA A100 GPUs

Engine Version

vLLM: v0.11.0
SGLang: v0.5.3
TensorRT-LLM: v1.0.0

Benchmark Dataset

ShareGPT
Random dataset with varying sequence lengths:
- Very long prompt: 32000 input tokens, 100 output tokens
- Long prompt: 4000 input tokens, 200 output tokens
- Medium prompt: 2000 input tokens, 100 output tokens
- Short prompt: 128 input tokens, 4 output tokens

Benchmark Script

We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:

# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# Benchmark on ShareGPT dataset
vllm bench serve --model zai-org/GLM-4.6 --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

# Benchmark on random dataset (fixed seed for reproducibility)
vllm bench serve --model zai-org/GLM-4.6 --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42

Experiment Results

The original BF16 precision model cannot be served on a single server with NVIDIA A100 GPUs due to memory limitations. Therefore, we focus on serving the FP8 quantization model.

1. Choosing the Inference Engine

vLLM

Serving script

vllm serve zai-org/GLM-4.6-FP8 -tp 8

Benchmark result

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  148.61
Total input tokens:                      214465
Total generated tokens:                  199904
Request throughput (req/s):              6.73
Output token throughput (tok/s):         1345.12
Peak output token throughput (tok/s):    2500.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          2788.21
---------------Time to First Token----------------
Mean TTFT (ms):                          51022.80
Median TTFT (ms):                        48308.20
P99 TTFT (ms):                           106768.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          154.20
Median TPOT (ms):                        144.84
P99 TPOT (ms):                           318.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.46
Median ITL (ms):                         117.84
P99 ITL (ms):                            326.23
==================================================

SGLang

Serving script

python3 -m sglang.launch_server --model-path zai-org/GLM-4.6-FP8 --host 0.0.0.0 --port 8000 --tp-size 8

Benchmark result

ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Result: vLLM (2788.21 tok/s) > SGLang(Not supported) = TensorRT-LLM (Not supported)

2. Parallelism in vLLM

TP+EP

Serving script

vllm serve zai-org/GLM-4.6-FP8 -tp 8 --enable-expert-parallel

Benchmark result

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  153.11
Total input tokens:                      214465
Total generated tokens:                  199904
Request throughput (req/s):              6.53
Output token throughput (tok/s):         1305.62
Peak output token throughput (tok/s):    2311.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          2706.34
---------------Time to First Token----------------
Mean TTFT (ms):                          51031.51
Median TTFT (ms):                        48265.76
P99 TTFT (ms):                           108400.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          158.11
Median TPOT (ms):                        149.37
P99 TPOT (ms):                           315.61
---------------Inter-token Latency----------------
Mean ITL (ms):                           139.37
Median ITL (ms):                         127.75
P99 ITL (ms):                            321.92
==================================================

3. Max Number of Batched Tokens in vLLM

Serving script

vllm serve zai-org/GLM-4.6-FP8 -tp 8 --max-num-batched-tokens 8192

Benchmark result

# --max-num-batched-tokens 4096
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  146.98
Total input tokens:                      214465
Total generated tokens:                  199904
Request throughput (req/s):              6.80
Output token throughput (tok/s):         1360.05
Peak output token throughput (tok/s):    2502.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          2819.17
---------------Time to First Token----------------
Mean TTFT (ms):                          49578.97
Median TTFT (ms):                        47246.86
P99 TTFT (ms):                           104952.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          159.13
Median TPOT (ms):                        144.06
P99 TPOT (ms):                           547.87
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.40
Median ITL (ms):                         117.57
P99 ITL (ms):                            458.51
==================================================

# --max-num-batched-tokens 8192
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  146.69
Total input tokens:                      214465
Total generated tokens:                  199904
Request throughput (req/s):              6.82
Output token throughput (tok/s):         1362.78
Peak output token throughput (tok/s):    2385.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          2824.82
---------------Time to First Token----------------
Mean TTFT (ms):                          49591.68
Median TTFT (ms):                        47020.36
P99 TTFT (ms):                           104829.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          160.28
Median TPOT (ms):                        143.24
P99 TPOT (ms):                           559.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.08
Median ITL (ms):                         118.83
P99 ITL (ms):                            320.13
==================================================

# --max-num-batched-tokens 16384
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  147.02
Total input tokens:                      214465
Total generated tokens:                  199904
Request throughput (req/s):              6.80
Output token throughput (tok/s):         1359.74
Peak output token throughput (tok/s):    2435.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          2818.53
---------------Time to First Token----------------
Mean TTFT (ms):                          49736.96
Median TTFT (ms):                        47034.75
P99 TTFT (ms):                           105084.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          160.50
Median TPOT (ms):                        143.84
P99 TPOT (ms):                           582.59
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.22
Median ITL (ms):                         118.12
P99 ITL (ms):                            325.43
==================================================

Summary of Optimization Options

Optimization Option	Throughput Improvement
Engine Selection	-
Parallelism	-
Max Number of Batched Tokens	+1.3%