Optimizing DeepSeek-R1 Throughput on NVIDIA H200 GPUs
Conclusion
Recommended configuration for optimizing throughput of DeepSeek-R1 on a single H200 node:
Serving Command
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-0528 --tp 8 --enable-dp-attention
Comparison of benchmark results before and after optimization:
| Benchmark Case | baseline (vLLM without any optimizations) | Optimized |
|---|---|---|
| ShareGPT | Total TPS: 5482.59 Mean TPOT(ms): 227.45 |
Total TPS: 7143.30 (+30.3%) Mean TPOT(ms): 304.71 |
| Short Prompt | Total TPS: 5931.25 Mean TPOT(ms): 351.83 |
Total TPS: 20448.42 (+244.8%) Mean TPOT(ms): 802.48 |
| Medium Prompt | Total TPS: 17690.89 Mean TPOT(ms): 237.02 |
Total TPS: 19024.33 (+7.5%) Mean TPOT(ms): 154.49 |
| Long Prompt | Total TPS: 15808.55 Mean TPOT(ms): 164.23 |
Total TPS: 16096.80 (+1.8%) Mean TPOT(ms): 107.91 |
| Very Long Prompt | Total TPS: 15514.57 Mean TPOT(ms): 365.79 |
Total TPS: 16091.68 (+3.7%) Mean TPOT(ms): 198.62 |
Note
- Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum.
- There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, etc. The conclusions in this document can serve as a starting point for more targeted optimizations.
- The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions.
If there are any missing points or updates reflecting new changes, please let us know.
Optimization Objective
Achieve high throughput under high-concurrency request scenarios.
Experimental Setup
Model
deepseek-ai/DeepSeek-R1-0528
Hardware
NVIDIA H200 GPUs
Engine Version
- vLLM: v0.11.0
- SGLang: v0.5.3
- TensorRT-LLM: v1.0.0
Benchmark Dataset
- ShareGPT
- Random dataset with varying sequence lengths:
- Very long prompt: 32000 input tokens, 100 output tokens
- Long prompt: 4000 input tokens, 200 output tokens
- Medium prompt: 2000 input tokens, 100 output tokens
- Short prompt: 128 input tokens, 4 output tokens
Benchmark Script
We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:
# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Benchmark on ShareGPT dataset
vllm bench serve --model deepseek-ai/DeepSeek-R1-0528 --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000
# Benchmark on random dataset (fixed seed for reproducibility)
vllm bench serve --model deepseek-ai/DeepSeek-R1-0528 --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42
Experiment Results
1. Choosing the Inference Engine
vLLM
Serving script
vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 76.59
Total input tokens: 219171
Total generated tokens: 200752
Request throughput (req/s): 13.06
Output token throughput (tok/s): 2621.05
Peak output token throughput (tok/s): 7451.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5482.59
---------------Time to First Token----------------
Mean TTFT (ms): 17743.08
Median TTFT (ms): 17698.39
P99 TTFT (ms): 24371.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 227.45
Median TPOT (ms): 177.90
P99 TPOT (ms): 568.18
---------------Inter-token Latency----------------
Mean ITL (ms): 119.29
Median ITL (ms): 66.09
P99 ITL (ms): 384.38
==================================================
SGLang
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tp-size 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 63.28
Total input tokens: 219171
Total generated tokens: 200767
Request throughput (req/s): 15.80
Output token throughput (tok/s): 3172.46
Peak output token throughput (tok/s): 9699.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 6635.73
---------------Time to First Token----------------
Mean TTFT (ms): 6948.08
Median TTFT (ms): 6793.80
P99 TTFT (ms): 13674.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 420.18
Median TPOT (ms): 216.55
P99 TPOT (ms): 2473.09
---------------Inter-token Latency----------------
Mean ITL (ms): 138.99
Median ITL (ms): 51.66
P99 ITL (ms): 1159.64
==================================================
TensorRT-LLM
Serving script
trtllm-serve /workspace/DeepSeek-R1-0528 --tp_size 8 --ep_size 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 124.04
Total input tokens: 219171
Total generated tokens: 200813
Request throughput (req/s): 8.06
Output token throughput (tok/s): 1618.94
Peak output token throughput (tok/s): 7636.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 3385.88
---------------Time to First Token----------------
Mean TTFT (ms): 20253.90
Median TTFT (ms): 20524.78
P99 TTFT (ms): 23743.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 221.31
Median TPOT (ms): 167.15
P99 TPOT (ms): 726.85
---------------Inter-token Latency----------------
Mean ITL (ms): 155.18
Median ITL (ms): 111.57
P99 ITL (ms): 1518.10
==================================================
Result: SGLang(6635.73 tok/s) > vLLM (5482.59 tok/s) > TensorRT-LLM (3385.88 tok/s)
2. Parallelism in SGLang
TP+DP Attention
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tp-size 8 --enable-dp-attention
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 58.79
Total input tokens: 219171
Total generated tokens: 200752
Request throughput (req/s): 17.01
Output token throughput (tok/s): 3414.99
Peak output token throughput (tok/s): 9601.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 7143.30
---------------Time to First Token----------------
Mean TTFT (ms): 5992.00
Median TTFT (ms): 5976.11
P99 TTFT (ms): 9650.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 304.71
Median TPOT (ms): 188.40
P99 TPOT (ms): 1704.65
---------------Inter-token Latency----------------
Mean ITL (ms): 121.64
Median ITL (ms): 51.15
P99 ITL (ms): 464.61
==================================================
3. Torch.compile in SGLang
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tp-size 8 --enable-dp-attention --enable-torch-compile
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 61.72
Total input tokens: 219171
Total generated tokens: 200813
Request throughput (req/s): 16.20
Output token throughput (tok/s): 3253.37
Peak output token throughput (tok/s): 9287.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 6804.17
---------------Time to First Token----------------
Mean TTFT (ms): 6978.36
Median TTFT (ms): 6965.90
P99 TTFT (ms): 10888.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 306.37
Median TPOT (ms): 203.10
P99 TPOT (ms): 1807.98
---------------Inter-token Latency----------------
Mean ITL (ms): 127.42
Median ITL (ms): 51.90
P99 ITL (ms): 596.33
==================================================
4. MTP in SGLang
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tp-size 8 --enable-dp-attention --enable-torch-compile --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 189.43
Total input tokens: 219171
Total generated tokens: 200813
Request throughput (req/s): 5.28
Output token throughput (tok/s): 1060.10
Peak output token throughput (tok/s): 1121.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 2217.12
---------------Time to First Token----------------
Mean TTFT (ms): 95444.76
Median TTFT (ms): 98104.26
P99 TTFT (ms): 176418.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 42.24
Median TPOT (ms): 42.11
P99 TPOT (ms): 82.73
---------------Inter-token Latency----------------
Mean ITL (ms): 74.27
Median ITL (ms): 36.60
P99 ITL (ms): 200.54
==================================================
Summary of Optimization Options
| Optimization Option | Throughput Improvement |
|---|---|
| Engine Selection | +21.0% |
| Parallelism | +7.6% |
| Torch Compile | - |
| MTP | - |
Other Benchmark Cases
We further benchmarked the optimized configuration to evaluate its generalization under various workloads.
Baseline serving script
vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8
Baseline benchmark results
# random 32K input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 206.90
Total input tokens: 3199900
Total generated tokens: 10000
Request throughput (req/s): 0.48
Output token throughput (tok/s): 48.33
Peak output token throughput (tok/s): 498.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 15514.57
---------------Time to First Token----------------
Mean TTFT (ms): 101351.99
Median TTFT (ms): 101163.80
P99 TTFT (ms): 200267.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 365.79
Median TPOT (ms): 400.72
P99 TPOT (ms): 404.78
---------------Inter-token Latency----------------
Mean ITL (ms): 362.20
Median ITL (ms): 459.96
P99 ITL (ms): 614.75
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 132.68
Total input tokens: 1997530
Total generated tokens: 100000
Request throughput (req/s): 3.77
Output token throughput (tok/s): 753.67
Peak output token throughput (tok/s): 2922.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 15808.55
---------------Time to First Token----------------
Mean TTFT (ms): 63058.71
Median TTFT (ms): 61952.74
P99 TTFT (ms): 124112.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 164.23
Median TPOT (ms): 177.37
P99 TPOT (ms): 195.80
---------------Inter-token Latency----------------
Mean ITL (ms): 163.74
Median ITL (ms): 57.87
P99 ITL (ms): 383.79
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 59.26
Total input tokens: 998393
Total generated tokens: 50000
Request throughput (req/s): 8.44
Output token throughput (tok/s): 843.71
Peak output token throughput (tok/s): 4036.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 17690.89
---------------Time to First Token----------------
Mean TTFT (ms): 28911.90
Median TTFT (ms): 27559.21
P99 TTFT (ms): 53497.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 237.02
Median TPOT (ms): 276.56
P99 TPOT (ms): 322.76
---------------Inter-token Latency----------------
Mean ITL (ms): 235.10
Median ITL (ms): 357.74
P99 ITL (ms): 379.46
==================================================
# random 128 input
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 22.07
Total input tokens: 126882
Total generated tokens: 4000
Request throughput (req/s): 45.32
Output token throughput (tok/s): 181.27
Peak output token throughput (tok/s): 2082.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5931.25
---------------Time to First Token----------------
Mean TTFT (ms): 19392.11
Median TTFT (ms): 18992.42
P99 TTFT (ms): 21440.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 351.83
Median TPOT (ms): 389.74
P99 TPOT (ms): 432.12
---------------Inter-token Latency----------------
Mean ITL (ms): 263.87
Median ITL (ms): 368.43
P99 ITL (ms): 470.22
==================================================
# ShareGPT batch size 4
============ Serving Benchmark Result ============
Successful requests: 100
Maximum request concurrency: 4
Benchmark duration (s): 108.42
Total input tokens: 22836
Total generated tokens: 21338
Request throughput (req/s): 0.92
Output token throughput (tok/s): 196.81
Peak output token throughput (tok/s): 216.00
Peak concurrent requests: 9.00
Total Token throughput (tok/s): 407.45
---------------Time to First Token----------------
Mean TTFT (ms): 123.30
Median TTFT (ms): 117.53
P99 TTFT (ms): 258.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 19.36
Median TPOT (ms): 19.13
P99 TPOT (ms): 22.67
---------------Inter-token Latency----------------
Mean ITL (ms): 19.41
Median ITL (ms): 18.84
P99 ITL (ms): 48.61
==================================================
Optimized serving script
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-0528 --tp 8 --enable-dp-attention
Optimized benchmark results
# random 32K input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 199.47
Total input tokens: 3199900
Total generated tokens: 9952
Request throughput (req/s): 0.50
Output token throughput (tok/s): 49.89
Peak output token throughput (tok/s): 521.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 16091.68
---------------Time to First Token----------------
Mean TTFT (ms): 98133.46
Median TTFT (ms): 97539.39
P99 TTFT (ms): 196094.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 198.62
Median TPOT (ms): 198.45
P99 TPOT (ms): 365.31
---------------Inter-token Latency----------------
Mean ITL (ms): 209.45
Median ITL (ms): 35.27
P99 ITL (ms): 1680.63
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 130.31
Total input tokens: 1997530
Total generated tokens: 100000
Request throughput (req/s): 3.84
Output token throughput (tok/s): 767.42
Peak output token throughput (tok/s): 3227.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 16096.80
---------------Time to First Token----------------
Mean TTFT (ms): 62808.37
Median TTFT (ms): 61113.90
P99 TTFT (ms): 122763.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 107.91
Median TPOT (ms): 105.31
P99 TPOT (ms): 179.79
---------------Inter-token Latency----------------
Mean ITL (ms): 115.81
Median ITL (ms): 46.40
P99 ITL (ms): 155.40
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 55.11
Total input tokens: 998393
Total generated tokens: 50000
Request throughput (req/s): 9.07
Output token throughput (tok/s): 907.31
Peak output token throughput (tok/s): 5778.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 19024.33
---------------Time to First Token----------------
Mean TTFT (ms): 27248.86
Median TTFT (ms): 25136.93
P99 TTFT (ms): 49876.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 154.49
Median TPOT (ms): 148.01
P99 TPOT (ms): 294.80
---------------Inter-token Latency----------------
Mean ITL (ms): 159.34
Median ITL (ms): 48.78
P99 ITL (ms): 801.82
==================================================
# random 128 input
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 6.40
Total input tokens: 126882
Total generated tokens: 4000
Request throughput (req/s): 156.24
Output token throughput (tok/s): 624.94
Peak output token throughput (tok/s): 2769.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 20448.42
---------------Time to First Token----------------
Mean TTFT (ms): 3626.25
Median TTFT (ms): 3413.13
P99 TTFT (ms): 4879.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 802.48
Median TPOT (ms): 940.27
P99 TPOT (ms): 1182.88
---------------Inter-token Latency----------------
Mean ITL (ms): 481.68
Median ITL (ms): 139.56
P99 ITL (ms): 3031.24
==================================================
