Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum.
There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, etc. The conclusions in this document can serve as a starting point for more targeted optimizations.
The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions.
Although using quantization may impact accuracy. FP8 quantization can achieves less than 1% accuracy drop for most models. See the evaluation results for more details. Therefore, it is highly recommended to use FP8 quantization for high-throughput serving scenarios.
If there are any missing points or updates reflecting new changes, please let us know.
Optimization Objective
Achieve high throughput under high-concurrency request scenarios.
Experimental Setup
Model
zai-org/GLM-4.6
Hardware
8 × NVIDIA A100 SXM 80GB GPUs on a single node.
Engine Version
vLLM: v0.11.0
SGLang: v0.5.3
TensorRT-LLM: v1.0.0
Benchmark Dataset
ShareGPT
Random dataset with varying sequence lengths:
Very long prompt: 32000 input tokens, 100 output tokens
Long prompt: 4000 input tokens, 200 output tokens
Medium prompt: 2000 input tokens, 100 output tokens
Short prompt: 128 input tokens, 4 output tokens
Benchmark Script
We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:
# Prepare the ShareGPT datasetwgethttps://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Benchmark on ShareGPT datasetvllmbenchserve--modelzai-org/GLM-4.6--backendopenai-chat--endpoint/v1/chat/completions--dataset-namesharegpt--dataset-pathShareGPT_V3_unfiltered_cleaned_split.json--num-prompts1000# Benchmark on random dataset (fixed seed for reproducibility)vllmbenchserve--modelzai-org/GLM-4.6--backendopenai-chat--endpoint/v1/chat/completions--dataset-namerandom--random-input-len4000--random-output-len200--num-prompts500--seed42
Experiment Results
The original BF16 precision model cannot be served on a single server with NVIDIA A100 GPUs due to memory limitations. Therefore, we focus on serving the FP8 quantization model.