Optimizing DeepSeek-V3.2 Throughput on NVIDIA H200 GPUs
Conclusion
Recommended configuration for optimizing throughput of DeepSeek-V3.2 on a single node with H200x8:
Serving Command
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32
Link for tool_chat_template_deepseekv32.jinja.
Based on the below benchmarks, we recommend the above configuration for optimizing DeepSeek-V3.2 throughput on 8×H200.
Parallelism and Tool Call Configuration provide the largest performance gains and are therefore included in the recommended command. Context Length Adjustment can further improve throughput but is highly workload-dependent and should be tuned according to actual usage. While Attention Backend optimizations show positive effects, their gains are relatively small and may vary across datasets, so they are not included in the default recommendation.
Comparison of benchmark results before and after optimization:
| Benchmark Case | baseline (vLLM without any optimizations) | Optimized |
|---|---|---|
| ShareGPT | Total TPS: 5713.95 Mean TPOT(ms): 275.03 |
Total TPS: 8968.32 (+56.95%) Mean TPOT(ms): 203.05 |
| Short Prompt | Total TPS: 10071.49 Mean TPOT(ms): 781.80 |
Total TPS: 18227.38 (+80.98%) Mean TPOT(ms): 776.55 |
| Medium Prompt | Total TPS: 10925.59 Mean TPOT(ms): 354.59 |
Total TPS: 27712.54 (+153.65%) Mean TPOT(ms): 192.24 |
| Long Prompt | Total TPS: 9974.26 Mean TPOT(ms): 226.74 |
Total TPS: 20545.67 (+105.99%) Mean TPOT(ms): 177.95 |
| Very Long Prompt | Total TPS: 9709.27 Mean TPOT(ms): 472.52 |
Total TPS: 20045.18 (+106.45%) Mean TPOT(ms): 246.26 |
| Generation-Heavy Prompt | Total TPS: 3112.52 Mean TPOT(ms): 45.72 |
Total TPS: 3703.98 (+19.0%) Mean TPOT(ms): 39.45 |
Note
- Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum.
- There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, etc. The conclusions in this document can serve as a starting point for more targeted optimizations.
- The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions.
If there are any missing points or updates reflecting new changes, please let us know.
Optimization Objective
Achieve high throughput under high-concurrency request scenarios.
Experimental Setup
Model
deepseek-ai/DeepSeek-V3.2
Hardware
8 × NVIDIA H200 SXM GPUs on a single node.
Engine Version
- vLLM: v0.13.0
- SGLang: v0.5.6.post2
- TensorRT-LLM: 1.2.0rc5
Benchmark Dataset
- ShareGPT
- Random dataset with varying sequence lengths:
- Very long prompt: 32000 input tokens, 100 output tokens
- Long prompt: 4000 input tokens, 200 output tokens
- Medium prompt: 2000 input tokens, 100 output tokens
- Short prompt: 128 input tokens, 4 output tokens
- Generation-Heavy Prompt: 1K input tokens, 2K output tokens
Benchmark Script
We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:
# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Benchmark on ShareGPT dataset
vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000
# Benchmark on random dataset (fixed seed for reproducibility)
vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42
Experiment Results
1. Baseline of the Inference Engine
vLLM
Serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 72.75
Total input tokens: 219111
Total generated tokens: 196599
Request throughput (req/s): 13.75
Output token throughput (tok/s): 2702.26
Peak output token throughput (tok/s): 6009.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5713.95
---------------Time to First Token----------------
Mean TTFT (ms): 13054.63
Median TTFT (ms): 12849.39
P99 TTFT (ms): 22754.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 275.03
Median TPOT (ms): 171.63
P99 TPOT (ms): 666.78
---------------Inter-token Latency----------------
Mean ITL (ms): 131.14
Median ITL (ms): 81.97
P99 ITL (ms): 668.29
==================================================
SGLang
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 138.25
Total input tokens: 219111
Total generated tokens: 197337
Request throughput (req/s): 7.23
Output token throughput (tok/s): 1427.44
Peak output token throughput (tok/s): 7949.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 3012.37
---------------Time to First Token----------------
Mean TTFT (ms): 11470.15
Median TTFT (ms): 11240.35
P99 TTFT (ms): 22363.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1103.96
Median TPOT (ms): 672.86
P99 TPOT (ms): 4231.55
---------------Inter-token Latency----------------
Mean ITL (ms): 389.06
Median ITL (ms): 64.46
P99 ITL (ms): 3013.62
==================================================
TensorRT-LLM
Serving script
trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 236.70
Total input tokens: 215176
Total generated tokens: 194893
Request throughput (req/s): 4.16
Output token throughput (tok/s): 823.39
Peak output token throughput (tok/s): 4212.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 1732.48
---------------Time to First Token----------------
Mean TTFT (ms): 65710.22
Median TTFT (ms): 66076.17
P99 TTFT (ms): 125723.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1391.76
Median TPOT (ms): 608.74
P99 TPOT (ms): 4785.14
---------------Inter-token Latency----------------
Mean ITL (ms): 459.69
Median ITL (ms): 141.56
P99 ITL (ms): 4910.46
==================================================
Result: vLLM (5713.95 tok/s) > SGLang(3012.37 tok/s) > TensorRT-LLM (1732.48 tok/s)
2. Optimizing vLLM
Parallelism: DP+EP
Serving script
# 81920 is half context, full context OOM
vllm serve deepseek-ai/DeepSeek-V3.2 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
-tp 1 -dp 8 --enable-expert-parallel --max-model-len 81920
Benchmark result
Successful requests: 1000
Benchmark duration (s): 65.62
Total input tokens: 219111
Total generated tokens: 197109
Request throughput (req/s): 15.24
Output token throughput (tok/s): 3003.90
Peak output token throughput (tok/s): 10222.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 6343.10
---------------Time to First Token----------------
Mean TTFT (ms): 9144.24
Median TTFT (ms): 10233.68
P99 TTFT (ms): 14920.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 278.22
Median TPOT (ms): 115.72
P99 TPOT (ms): 2048.82
---------------Inter-token Latency----------------
Mean ITL (ms): 100.47
Median ITL (ms): 74.61
P99 ITL (ms): 1239.62
==================================================
Parallelism: DCP
Serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
-dcp 8
DeepSeek V3.2 relies on the FlashMLA sparse attention backend, which currently does not expose softmax log-sum-exp (LSE) during the decode phase. Since Decode Context Parallelism (DCP) requires softmax LSE for correct cross-rank aggregation, DCP is not supported with FlashMLA at this time, leading to a runtime failure in vLLM. This limitation has been discussed in the vLLM repository (see issue #27544).
MTP
Serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
--speculative-config {"method":"mtp","num_speculative_tokens":1}
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 75.13
Total input tokens: 219111
Total generated tokens: 197345
Request throughput (req/s): 13.31
Output token throughput (tok/s): 2626.63
Peak output token throughput (tok/s): 3940.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5542.96
---------------Time to First Token----------------
Mean TTFT (ms): 11110.51
Median TTFT (ms): 10739.31
P99 TTFT (ms): 21886.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 291.48
Median TPOT (ms): 250.14
P99 TPOT (ms): 735.48
---------------Inter-token Latency----------------
Mean ITL (ms): 290.48
Median ITL (ms): 144.81
P99 ITL (ms): 4636.00
==================================================
Turn off DeepGEMM in vLLM
Serving script
export VLLM_USE_DEEP_GEMM=0
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 74.92
Total input tokens: 219111
Total generated tokens: 197222
Request throughput (req/s): 13.35
Output token throughput (tok/s): 2632.38
Peak output token throughput (tok/s): 6010.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5556.92
---------------Time to First Token----------------
Mean TTFT (ms): 11250.21
Median TTFT (ms): 10762.25
P99 TTFT (ms): 20723.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 287.49
Median TPOT (ms): 188.99
P99 TPOT (ms): 667.67
---------------Inter-token Latency----------------
Mean ITL (ms): 140.98
Median ITL (ms): 83.88
P99 ITL (ms): 671.43
==================================================
Enable Tool Call Configuration
Serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 --tool-call-parser deepseek_v32 --enable-auto-tool-choice
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 74.16
Total input tokens: 219111
Total generated tokens: 197012
Request throughput (req/s): 13.48
Output token throughput (tok/s): 2656.40
Peak output token throughput (tok/s): 6095.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5610.78
---------------Time to First Token----------------
Mean TTFT (ms): 13648.83
Median TTFT (ms): 13557.38
P99 TTFT (ms): 23498.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 280.78
Median TPOT (ms): 176.29
P99 TPOT (ms): 677.36
---------------Inter-token Latency----------------
Mean ITL (ms): 132.99
Median ITL (ms): 82.69
P99 ITL (ms): 683.11
==================================================
Context Length Adjustment
Serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 --max-model-len 32768
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 75.12
Total input tokens: 219111
Total generated tokens: 196938
Request throughput (req/s): 13.31
Output token throughput (tok/s): 2621.48
Peak output token throughput (tok/s): 6293.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5538.11
---------------Time to First Token----------------
Mean TTFT (ms): 10439.14
Median TTFT (ms): 10296.60
P99 TTFT (ms): 23842.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 357.68
Median TPOT (ms): 223.35
P99 TPOT (ms): 1169.74
---------------Inter-token Latency----------------
Mean ITL (ms): 154.24
Median ITL (ms): 78.13
P99 ITL (ms): 660.95
==================================================
Attention Backend: CUTLASS_MLA
Serving script
export VLLM_ATTENTION_BACKEND="CUTLASS_MLA"
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
ValueError: Selected backend AttentionBackendEnum.CUTLASS_MLA is not valid for this configuration. Reason: ['sparse not supported', 'compute capability not supported']
Attention Backend: FLASHMLA
Serving script
export VLLM_ATTENTION_BACKEND="FLASHMLA"
ValueError: Selected backend AttentionBackendEnum.FLASHMLA is not valid for this configuration. Reason: ['sparse not supported']
Attention Backend: TRITON_MLA
Serving script
export VLLM_ATTENTION_BACKEND="TRITON_MLA"
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
ValueError: Selected backend AttentionBackendEnum.TRITON_MLA is not valid for this configuration. Reason: ['sparse not supported']
3. Optimizing SGLang
Parallelism: TP+DP Attention
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --enable-dp-attention
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 99.46
Total input tokens: 219111
Total generated tokens: 197633
Request throughput (req/s): 10.05
Output token throughput (tok/s): 1987.11
Peak output token throughput (tok/s): 8911.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 4190.17
---------------Time to First Token----------------
Mean TTFT (ms): 13679.11
Median TTFT (ms): 12663.74
P99 TTFT (ms): 21665.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 608.90
Median TPOT (ms): 398.19
P99 TPOT (ms): 3453.39
---------------Inter-token Latency----------------
Mean ITL (ms): 230.03
Median ITL (ms): 59.95
P99 ITL (ms): 1824.84
==================================================
Parallelism: TP+DP+DP Attention
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 56.62
Total input tokens: 219111
Total generated tokens: 197116
Request throughput (req/s): 17.66
Output token throughput (tok/s): 3481.55
Peak output token throughput (tok/s): 11298.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 7351.59
---------------Time to First Token----------------
Mean TTFT (ms): 7815.00
Median TTFT (ms): 8024.81
P99 TTFT (ms): 12928.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 287.86
Median TPOT (ms): 107.58
P99 TPOT (ms): 2096.18
---------------Inter-token Latency----------------
Mean ITL (ms): 91.68
Median ITL (ms): 58.87
P99 ITL (ms): 317.43
==================================================
Parallelism: TP+DP+DP Attention+EP
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention --ep-size 8
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 57.14
Total input tokens: 219111
Total generated tokens: 197614
Request throughput (req/s): 17.50
Output token throughput (tok/s): 3458.68
Peak output token throughput (tok/s): 11757.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 7293.61
---------------Time to First Token----------------
Mean TTFT (ms): 8437.36
Median TTFT (ms): 8346.50
P99 TTFT (ms): 15410.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 319.46
Median TPOT (ms): 111.65
P99 TPOT (ms): 2443.91
---------------Inter-token Latency----------------
Mean ITL (ms): 94.48
Median ITL (ms): 56.54
P99 ITL (ms): 451.92
==================================================
MTP
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--tp-size 8 \
--speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 278.36
Total input tokens: 219111
Total generated tokens: 193349
Request throughput (req/s): 3.59
Output token throughput (tok/s): 694.61
Peak output token throughput (tok/s): 974.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 1481.77
---------------Time to First Token----------------
Mean TTFT (ms): 139046.94
Median TTFT (ms): 144733.94
P99 TTFT (ms): 260932.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.88
Median TPOT (ms): 63.31
P99 TPOT (ms): 280.02
---------------Inter-token Latency----------------
Mean ITL (ms): 115.50
Median ITL (ms): 48.42
P99 ITL (ms): 337.68
==================================================
Turn off DeepGEMM
Serving script
export SGLANG_ENABLE_JIT_DEEPGEMM=0
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8
The server fails to start when DeepGEMM is disabled.
Enable Tool Call Configuration
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 49.70
Total input tokens: 219111
Total generated tokens: 197192
Request throughput (req/s): 20.12
Output token throughput (tok/s): 3967.70
Peak output token throughput (tok/s): 13083.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8376.43
---------------Time to First Token----------------
Mean TTFT (ms): 6577.70
Median TTFT (ms): 6815.18
P99 TTFT (ms): 11936.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 250.64
Median TPOT (ms): 95.20
P99 TPOT (ms): 1825.45
---------------Inter-token Latency----------------
Mean ITL (ms): 81.42
Median ITL (ms): 53.51
P99 ITL (ms): 269.50
==================================================
Context Length Adjustment
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length=32768
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 47.60
Total input tokens: 219111
Total generated tokens: 197380
Request throughput (req/s): 21.01
Output token throughput (tok/s): 4146.96
Peak output token throughput (tok/s): 12798.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8750.49
---------------Time to First Token----------------
Mean TTFT (ms): 5615.34
Median TTFT (ms): 5448.81
P99 TTFT (ms): 10053.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 214.05
Median TPOT (ms): 87.57
P99 TPOT (ms): 1553.18
---------------Inter-token Latency----------------
Mean ITL (ms): 76.04
Median ITL (ms): 54.16
P99 ITL (ms): 210.36
==================================================
KV Cache DType
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length=32768 \
--kv-cache-dtype fp8_e4m3
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 49.00
Total input tokens: 219111
Total generated tokens: 197078
Request throughput (req/s): 20.41
Output token throughput (tok/s): 4022.27
Peak output token throughput (tok/s): 12674.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8494.23
---------------Time to First Token----------------
Mean TTFT (ms): 5472.04
Median TTFT (ms): 5289.41
P99 TTFT (ms): 9731.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 213.50
Median TPOT (ms): 92.04
P99 TPOT (ms): 1471.85
---------------Inter-token Latency----------------
Mean ITL (ms): 79.15
Median ITL (ms): 58.50
P99 ITL (ms): 100.53
==================================================
Attention Backend: fa3 + fa3
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length=32768 \
--attention-backend nsa \
--nsa-prefill-backend fa3 \
--nsa-decode-backend fa3
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 46.37
Total input tokens: 219111
Total generated tokens: 196786
Request throughput (req/s): 21.56
Output token throughput (tok/s): 4243.46
Peak output token throughput (tok/s): 13860.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8968.32
---------------Time to First Token----------------
Mean TTFT (ms): 5303.50
Median TTFT (ms): 5145.47
P99 TTFT (ms): 9366.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 203.05
Median TPOT (ms): 85.23
P99 TPOT (ms): 1399.48
---------------Inter-token Latency----------------
Mean ITL (ms): 74.03
Median ITL (ms): 53.86
P99 ITL (ms): 177.03
==================================================
Attention Backend: flashmla_sparse + flashmla_kv
Serving script
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0.0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length=32768 \
--attention-backend nsa \
--nsa-prefill-backend flashmla_sparse \
--nsa-decode-backend flashmla_kv
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 77.68
Total input tokens: 219111
Total generated tokens: 197443
Request throughput (req/s): 12.87
Output token throughput (tok/s): 2541.62
Peak output token throughput (tok/s): 8352.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5362.16
---------------Time to First Token----------------
Mean TTFT (ms): 4811.25
Median TTFT (ms): 4727.17
P99 TTFT (ms): 9086.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 260.25
Median TPOT (ms): 128.33
P99 TPOT (ms): 1625.53
---------------Inter-token Latency----------------
Mean ITL (ms): 114.47
Median ITL (ms): 91.44
P99 ITL (ms): 256.14
==================================================
4. Optimizing TensorRT-LLM
Parallelism: TP+EP
Serving script
trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8 --ep_size 8 --pp_size 1
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 142.99
Total input tokens: 216405
Total generated tokens: 195740
Request throughput (req/s): 6.88
Output token throughput (tok/s): 1368.90
Peak output token throughput (tok/s): 4635.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 2882.32
---------------Time to First Token----------------
Mean TTFT (ms): 24155.45
Median TTFT (ms): 23945.42
P99 TTFT (ms): 45683.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 561.24
Median TPOT (ms): 312.75
P99 TPOT (ms): 1473.03
---------------Inter-token Latency----------------
Mean ITL (ms): 248.40
Median ITL (ms): 142.18
P99 ITL (ms): 1819.83
==================================================
Turn off DeepGEMM
Serving script
export TRTLLM_DG_ENABLED=0
trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8 --ep_size 8 --pp_size 1
Benchmark result
============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 134.65
Total input tokens: 216484
Total generated tokens: 194509
Request throughput (req/s): 7.31
Output token throughput (tok/s): 1444.56
Peak output token throughput (tok/s): 4661.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 3052.33
---------------Time to First Token----------------
Mean TTFT (ms): 23229.65
Median TTFT (ms): 23100.28
P99 TTFT (ms): 44648.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 541.07
Median TPOT (ms): 264.04
P99 TPOT (ms): 1588.04
---------------Inter-token Latency----------------
Mean ITL (ms): 226.30
Median ITL (ms): 137.80
P99 ITL (ms): 1624.47
==================================================
Summary of Optimization Options
| Optimization Option | Throughput Improvement |
|---|---|
| Parallelism | +28.66% |
| Tool Call Configuration | +13.94% |
| Context Length Adjust | +4.47% |
| Attention Backend | +2.49% |
| KV Cache Dtype | - |
| MTP | - |
| DeepGEMM | - |
| Total(vs Baseline) | +56.97% |
Other Benchmark Cases
We further benchmarked the optimized configuration to evaluate its generalization under various workloads.
Baseline serving script
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Baseline benchmark results
# random 128 input
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 13.11
Total input tokens: 128000
Total generated tokens: 4000
Request throughput (req/s): 76.30
Output token throughput (tok/s): 305.20
Peak output token throughput (tok/s): 1029.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 10071.49
---------------Time to First Token----------------
Mean TTFT (ms): 5786.15
Median TTFT (ms): 4666.02
P99 TTFT (ms): 12922.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 781.80
Median TPOT (ms): 687.63
P99 TPOT (ms): 1637.52
---------------Inter-token Latency----------------
Mean ITL (ms): 586.79
Median ITL (ms): 664.73
P99 ITL (ms): 3545.51
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 96.10
Total input tokens: 1000000
Total generated tokens: 50000
Request throughput (req/s): 5.20
Output token throughput (tok/s): 520.27
Peak output token throughput (tok/s): 4085.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 10925.59
---------------Time to First Token----------------
Mean TTFT (ms): 45667.57
Median TTFT (ms): 43997.53
P99 TTFT (ms): 87384.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 354.59
Median TPOT (ms): 433.59
P99 TPOT (ms): 452.30
---------------Inter-token Latency----------------
Mean ITL (ms): 353.26
Median ITL (ms): 68.59
P99 ITL (ms): 661.87
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 210.54
Total input tokens: 2000000
Total generated tokens: 100000
Request throughput (req/s): 2.37
Output token throughput (tok/s): 474.96
Peak output token throughput (tok/s): 2769.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 9974.26
---------------Time to First Token----------------
Mean TTFT (ms): 103386.88
Median TTFT (ms): 97324.68
P99 TTFT (ms): 200514.23
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 226.74
Median TPOT (ms): 246.98
P99 TPOT (ms): 275.68
---------------Inter-token Latency----------------
Mean ITL (ms): 228.79
Median ITL (ms): 49.50
P99 ITL (ms): 683.63
==================================================
# random 32k input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 330.61
Total input tokens: 3200000
Total generated tokens: 10000
Request throughput (req/s): 0.30
Output token throughput (tok/s): 30.25
Peak output token throughput (tok/s): 384.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 9709.27
---------------Time to First Token----------------
Mean TTFT (ms): 164003.81
Median TTFT (ms): 164174.93
P99 TTFT (ms): 325134.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 472.52
Median TPOT (ms): 512.36
P99 TPOT (ms): 514.39
---------------Inter-token Latency----------------
Mean ITL (ms): 477.69
Median ITL (ms): 741.90
P99 ITL (ms): 949.57
==================================================
1k random input + 2k generation
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 96.39
Total input tokens: 100000
Total generated tokens: 200000
Request throughput (req/s): 1.04
Output token throughput (tok/s): 2075.01
Peak output token throughput (tok/s): 2598.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 3112.52
---------------Time to First Token----------------
Mean TTFT (ms): 4789.20
Median TTFT (ms): 4590.83
P99 TTFT (ms): 10608.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.72
Median TPOT (ms): 45.83
P99 TPOT (ms): 47.44
---------------Inter-token Latency----------------
Mean ITL (ms): 46.34
Median ITL (ms): 43.13
P99 ITL (ms): 44.29
==================================================
Optimized serving script
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length=32768 \
--attention-backend nsa \
--nsa-prefill-backend fa3 \
--nsa-decode-backend fa3
Optimized benchmark results
# random 128 input
============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 7.13
Total input tokens: 125952
Total generated tokens: 3936
Request throughput (req/s): 138.09
Output token throughput (tok/s): 552.34
Peak output token throughput (tok/s): 2526.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 18227.38
---------------Time to First Token----------------
Mean TTFT (ms): 4441.81
Median TTFT (ms): 4685.89
P99 TTFT (ms): 6735.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 776.55
Median TPOT (ms): 714.97
P99 TPOT (ms): 1659.07
---------------Inter-token Latency----------------
Mean ITL (ms): 465.93
Median ITL (ms): 59.97
P99 ITL (ms): 4564.16
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 37.89
Total input tokens: 1000000
Total generated tokens: 50000
Request throughput (req/s): 13.20
Output token throughput (tok/s): 1319.64
Peak output token throughput (tok/s): 9477.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 27712.54
---------------Time to First Token----------------
Mean TTFT (ms): 18696.78
Median TTFT (ms): 18939.89
P99 TTFT (ms): 32138.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 192.24
Median TPOT (ms): 189.20
P99 TPOT (ms): 349.96
---------------Inter-token Latency----------------
Mean ITL (ms): 189.18
Median ITL (ms): 55.05
P99 ITL (ms): 219.74
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 102.21
Total input tokens: 2000000
Total generated tokens: 100000
Request throughput (req/s): 4.89
Output token throughput (tok/s): 978.37
Peak output token throughput (tok/s): 7180.00
Peak concurrent requests: 500.00
Total Token throughput (tok/s): 20545.67
---------------Time to First Token----------------
Mean TTFT (ms): 44019.81
Median TTFT (ms): 40806.24
P99 TTFT (ms): 90171.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 177.95
Median TPOT (ms): 161.14
P99 TPOT (ms): 321.10
---------------Inter-token Latency----------------
Mean ITL (ms): 177.57
Median ITL (ms): 49.55
P99 ITL (ms): 224.52
==================================================
# random 32k input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 160.14
Total input tokens: 3200000
Total generated tokens: 10000
Request throughput (req/s): 0.62
Output token throughput (tok/s): 62.45
Peak output token throughput (tok/s): 1189.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 20045.18
---------------Time to First Token----------------
Mean TTFT (ms): 83964.72
Median TTFT (ms): 88265.70
P99 TTFT (ms): 157005.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 246.26
Median TPOT (ms): 229.67
P99 TPOT (ms): 514.42
---------------Inter-token Latency----------------
Mean ITL (ms): 244.63
Median ITL (ms): 33.21
P99 ITL (ms): 898.58
==================================================
1k random input + 2k generation
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 80.99
Total input tokens: 100000
Total generated tokens: 200000
Request throughput (req/s): 1.23
Output token throughput (tok/s): 2469.32
Peak output token throughput (tok/s): 2800.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 3703.98
---------------Time to First Token----------------
Mean TTFT (ms): 2103.54
Median TTFT (ms): 2216.52
P99 TTFT (ms): 3514.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.45
Median TPOT (ms): 39.40
P99 TPOT (ms): 40.17
---------------Inter-token Latency----------------
Mean ITL (ms): 40.56
Median ITL (ms): 38.90
P99 ITL (ms): 43.63
==================================================