Optimizing DeepSeek-V3.2 Throughput on NVIDIA H200 GPUs Conclusion Recommended configuration for optimizing throughput of DeepSeek-V3.2 on a single node with H200x8:
Serving Command python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32
Link for tool_chat_template_deepseekv32.jinja .
Based on the below benchmarks, we recommend the above configuration for optimizing DeepSeek-V3.2 throughput on 8×H200 .
Parallelism and Tool Call Configuration provide the largest performance gains and are therefore included in the recommended command. Context Length Adjustment can further improve throughput but is highly workload-dependent and should be tuned according to actual usage. While Attention Backend optimizations show positive effects, their gains are relatively small and may vary across datasets, so they are not included in the default recommendation.
Comparison of benchmark results before and after optimization:
Benchmark Case baseline (vLLM without any optimizations) Optimized ShareGPT Total TPS: 5713.95 Mean TPOT(ms): 275.03 Total TPS: 8968.32 (+56.95%) Mean TPOT(ms): 203.05 Short Prompt Total TPS: 10071.49 Mean TPOT(ms): 781.80 Total TPS: 18227.38 (+80.98%) Mean TPOT(ms): 776.55 Medium Prompt Total TPS: 10925.59 Mean TPOT(ms): 354.59 Total TPS: 27712.54 (+153.65%) Mean TPOT(ms): 192.24 Long Prompt Total TPS: 9974.26 Mean TPOT(ms): 226.74 Total TPS: 20545.67 (+105.99%) Mean TPOT(ms): 177.95 Very Long Prompt Total TPS: 9709.27 Mean TPOT(ms): 472.52 Total TPS: 20045.18 (+106.45%) Mean TPOT(ms): 246.26 Generation-Heavy Prompt Total TPS: 3112.52 Mean TPOT(ms): 45.72 Total TPS: 3703.98 (+19.0%) Mean TPOT(ms): 39.45
Note
Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum. There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, etc. The conclusions in this document can serve as a starting point for more targeted optimizations. The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions. If there are any missing points or updates reflecting new changes, please let us know .
Optimization Objective Achieve high throughput under high-concurrency request scenarios.
Experimental Setup Model deepseek-ai/DeepSeek-V3.2
Hardware 8 × NVIDIA H200 SXM GPUs on a single node.
Engine Version vLLM: v0.13.0 SGLang: v0.5.6.post2 TensorRT-LLM: 1.2.0rc5 Benchmark Dataset ShareGPT Random dataset with varying sequence lengths: Very long prompt: 32000 input tokens, 100 output tokens Long prompt: 4000 input tokens, 200 output tokens Medium prompt: 2000 input tokens, 100 output tokens Short prompt: 128 input tokens, 4 output tokens Generation-Heavy Prompt: 1K input tokens, 2K output tokens Benchmark Script We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:
# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Benchmark on ShareGPT dataset
vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000
# Benchmark on random dataset (fixed seed for reproducibility)
vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42
Experiment Results 1. Baseline of the Inference Engine vLLM
Serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 72.75
Total input tokens: 219111
Total generated tokens: 196599
Request throughput (req/s): 13.75
Output token throughput (tok/s): 2702.26
Peak output token throughput (tok/s): 6009.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5713.95
---------------Time to First Token----------------
Mean TTFT (ms): 13054.63
Median TTFT (ms): 12849.39
P99 TTFT (ms): 22754.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 275.03
Median TPOT (ms): 171.63
P99 TPOT (ms): 666.78
---------------Inter-token Latency----------------
Mean ITL (ms): 131.14
Median ITL (ms): 81.97
P99 ITL (ms): 668.29
==================================================
SGLang
Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 138.25
Total input tokens: 219111
Total generated tokens: 197337
Request throughput (req/s): 7.23
Output token throughput (tok/s): 1427.44
Peak output token throughput (tok/s): 7949.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 3012.37
---------------Time to First Token----------------
Mean TTFT (ms): 11470.15
Median TTFT (ms): 11240.35
P99 TTFT (ms): 22363.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1103.96
Median TPOT (ms): 672.86
P99 TPOT (ms): 4231.55
---------------Inter-token Latency----------------
Mean ITL (ms): 389.06
Median ITL (ms): 64.46
P99 ITL (ms): 3013.62
==================================================
TensorRT-LLM
Serving script trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 236.70
Total input tokens: 215176
Total generated tokens: 194893
Request throughput (req/s): 4.16
Output token throughput (tok/s): 823.39
Peak output token throughput (tok/s): 4212.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 1732.48
---------------Time to First Token----------------
Mean TTFT (ms): 65710.22
Median TTFT (ms): 66076.17
P99 TTFT (ms): 125723.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1391.76
Median TPOT (ms): 608.74
P99 TPOT (ms): 4785.14
---------------Inter-token Latency----------------
Mean ITL (ms): 459.69
Median ITL (ms): 141.56
P99 ITL (ms): 4910.46
==================================================
Result: vLLM (5713.95 tok/s) > SGLang(3012.37 tok/s) > TensorRT-LLM (1732.48 tok/s)
2. Optimizing vLLM Parallelism: DP+EP Serving script # 81920 is half context, full context OOM
vllm serve deepseek-ai/DeepSeek-V3.2 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
-tp 1 -dp 8 --enable-expert-parallel --max-model-len 81920
Benchmark result Successful requests: 1000
Benchmark duration (s): 65.62
Total input tokens: 219111
Total generated tokens: 197109
Request throughput (req/s): 15.24
Output token throughput (tok/s): 3003.90
Peak output token throughput (tok/s): 10222.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 6343.10
---------------Time to First Token----------------
Mean TTFT (ms): 9144.24
Median TTFT (ms): 10233.68
P99 TTFT (ms): 14920.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 278.22
Median TPOT (ms): 115.72
P99 TPOT (ms): 2048.82
---------------Inter-token Latency----------------
Mean ITL (ms): 100.47
Median ITL (ms): 74.61
P99 ITL (ms): 1239.62
==================================================
Parallelism: DCP Serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
-dcp 8
DeepSeek V3.2 relies on the FlashMLA sparse attention backend, which currently does not expose softmax log-sum-exp (LSE) during the decode phase. Since Decode Context Parallelism (DCP) requires softmax LSE for correct cross-rank aggregation, DCP is not supported with FlashMLA at this time, leading to a runtime failure in vLLM. This limitation has been discussed in the vLLM repository (see issue #27544 ).
MTP Serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 \
--speculative-config { "method" :"mtp" ,"num_speculative_tokens" :1}
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 75.13
Total input tokens: 219111
Total generated tokens: 197345
Request throughput (req/s): 13.31
Output token throughput (tok/s): 2626.63
Peak output token throughput (tok/s): 3940.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5542.96
---------------Time to First Token----------------
Mean TTFT (ms): 11110.51
Median TTFT (ms): 10739.31
P99 TTFT (ms): 21886.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 291.48
Median TPOT (ms): 250.14
P99 TPOT (ms): 735.48
---------------Inter-token Latency----------------
Mean ITL (ms): 290.48
Median ITL (ms): 144.81
P99 ITL (ms): 4636.00
==================================================
Turn off DeepGEMM in vLLM Serving script export VLLM_USE_DEEP_GEMM = 0
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 74.92
Total input tokens: 219111
Total generated tokens: 197222
Request throughput (req/s): 13.35
Output token throughput (tok/s): 2632.38
Peak output token throughput (tok/s): 6010.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5556.92
---------------Time to First Token----------------
Mean TTFT (ms): 11250.21
Median TTFT (ms): 10762.25
P99 TTFT (ms): 20723.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 287.49
Median TPOT (ms): 188.99
P99 TPOT (ms): 667.67
---------------Inter-token Latency----------------
Mean ITL (ms): 140.98
Median ITL (ms): 83.88
P99 ITL (ms): 671.43
==================================================
Serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 --tool-call-parser deepseek_v32 --enable-auto-tool-choice
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 74.16
Total input tokens: 219111
Total generated tokens: 197012
Request throughput (req/s): 13.48
Output token throughput (tok/s): 2656.40
Peak output token throughput (tok/s): 6095.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5610.78
---------------Time to First Token----------------
Mean TTFT (ms): 13648.83
Median TTFT (ms): 13557.38
P99 TTFT (ms): 23498.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 280.78
Median TPOT (ms): 176.29
P99 TPOT (ms): 677.36
---------------Inter-token Latency----------------
Mean ITL (ms): 132.99
Median ITL (ms): 82.69
P99 ITL (ms): 683.11
==================================================
Context Length Adjustment Serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3 --max-model-len 32768
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration ( s) : 75 .12
Total input tokens: 219111
Total generated tokens: 196938
Request throughput ( req/s) : 13 .31
Output token throughput ( tok/s) : 2621 .48
Peak output token throughput ( tok/s) : 6293 .00
Peak concurrent requests: 1000 .00
Total Token throughput ( tok/s) : 5538 .11
---------------Time to First Token----------------
Mean TTFT ( ms) : 10439 .14
Median TTFT ( ms) : 10296 .60
P99 TTFT ( ms) : 23842 .17
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 357 .68
Median TPOT ( ms) : 223 .35
P99 TPOT ( ms) : 1169 .74
---------------Inter-token Latency----------------
Mean ITL ( ms) : 154 .24
Median ITL ( ms) : 78 .13
P99 ITL ( ms) : 660 .95
==================================================
Attention Backend: CUTLASS_MLA Serving script export VLLM_ATTENTION_BACKEND = "CUTLASS_MLA"
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
ValueError: Selected backend AttentionBackendEnum.CUTLASS_MLA is not valid for this configuration. Reason: ['sparse not supported', 'compute capability not supported']
Attention Backend: FLASHMLA Serving script export VLLM_ATTENTION_BACKEND = "FLASHMLA"
ValueError: Selected backend AttentionBackendEnum.FLASHMLA is not valid for this configuration. Reason: ['sparse not supported']
Attention Backend: TRITON_MLA Serving script export VLLM_ATTENTION_BACKEND = "TRITON_MLA"
vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
ValueError: Selected backend AttentionBackendEnum.TRITON_MLA is not valid for this configuration. Reason: ['sparse not supported']
3. Optimizing SGLang Parallelism: TP+DP Attention Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --enable-dp-attention
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 99.46
Total input tokens: 219111
Total generated tokens: 197633
Request throughput (req/s): 10.05
Output token throughput (tok/s): 1987.11
Peak output token throughput (tok/s): 8911.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 4190.17
---------------Time to First Token----------------
Mean TTFT (ms): 13679.11
Median TTFT (ms): 12663.74
P99 TTFT (ms): 21665.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 608.90
Median TPOT (ms): 398.19
P99 TPOT (ms): 3453.39
---------------Inter-token Latency----------------
Mean ITL (ms): 230.03
Median ITL (ms): 59.95
P99 ITL (ms): 1824.84
==================================================
Parallelism: TP+DP+DP Attention Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 56.62
Total input tokens: 219111
Total generated tokens: 197116
Request throughput (req/s): 17.66
Output token throughput (tok/s): 3481.55
Peak output token throughput (tok/s): 11298.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 7351.59
---------------Time to First Token----------------
Mean TTFT (ms): 7815.00
Median TTFT (ms): 8024.81
P99 TTFT (ms): 12928.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 287.86
Median TPOT (ms): 107.58
P99 TPOT (ms): 2096.18
---------------Inter-token Latency----------------
Mean ITL (ms): 91.68
Median ITL (ms): 58.87
P99 ITL (ms): 317.43
==================================================
Parallelism: TP+DP+DP Attention+EP Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention --ep-size 8
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 57.14
Total input tokens: 219111
Total generated tokens: 197614
Request throughput (req/s): 17.50
Output token throughput (tok/s): 3458.68
Peak output token throughput (tok/s): 11757.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 7293.61
---------------Time to First Token----------------
Mean TTFT (ms): 8437.36
Median TTFT (ms): 8346.50
P99 TTFT (ms): 15410.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 319.46
Median TPOT (ms): 111.65
P99 TPOT (ms): 2443.91
---------------Inter-token Latency----------------
Mean ITL (ms): 94.48
Median ITL (ms): 56.54
P99 ITL (ms): 451.92
==================================================
MTP Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--tp-size 8 \
--speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 278.36
Total input tokens: 219111
Total generated tokens: 193349
Request throughput (req/s): 3.59
Output token throughput (tok/s): 694.61
Peak output token throughput (tok/s): 974.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 1481.77
---------------Time to First Token----------------
Mean TTFT (ms): 139046.94
Median TTFT (ms): 144733.94
P99 TTFT (ms): 260932.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.88
Median TPOT (ms): 63.31
P99 TPOT (ms): 280.02
---------------Inter-token Latency----------------
Mean ITL (ms): 115.50
Median ITL (ms): 48.42
P99 ITL (ms): 337.68
==================================================
Turn off DeepGEMM Serving script export SGLANG_ENABLE_JIT_DEEPGEMM = 0
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8
The server fails to start when DeepGEMM is disabled.
Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 49.70
Total input tokens: 219111
Total generated tokens: 197192
Request throughput (req/s): 20.12
Output token throughput (tok/s): 3967.70
Peak output token throughput (tok/s): 13083.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8376.43
---------------Time to First Token----------------
Mean TTFT (ms): 6577.70
Median TTFT (ms): 6815.18
P99 TTFT (ms): 11936.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 250.64
Median TPOT (ms): 95.20
P99 TPOT (ms): 1825.45
---------------Inter-token Latency----------------
Mean ITL (ms): 81.42
Median ITL (ms): 53.51
P99 ITL (ms): 269.50
==================================================
Context Length Adjustment Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length= 32768
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 47.60
Total input tokens: 219111
Total generated tokens: 197380
Request throughput (req/s): 21.01
Output token throughput (tok/s): 4146.96
Peak output token throughput (tok/s): 12798.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8750.49
---------------Time to First Token----------------
Mean TTFT (ms): 5615.34
Median TTFT (ms): 5448.81
P99 TTFT (ms): 10053.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 214.05
Median TPOT (ms): 87.57
P99 TPOT (ms): 1553.18
---------------Inter-token Latency----------------
Mean ITL (ms): 76.04
Median ITL (ms): 54.16
P99 ITL (ms): 210.36
==================================================
KV Cache DType Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length= 32768 \
--kv-cache-dtype fp8_e4m3
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 49.00
Total input tokens: 219111
Total generated tokens: 197078
Request throughput (req/s): 20.41
Output token throughput (tok/s): 4022.27
Peak output token throughput (tok/s): 12674.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8494.23
---------------Time to First Token----------------
Mean TTFT (ms): 5472.04
Median TTFT (ms): 5289.41
P99 TTFT (ms): 9731.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 213.50
Median TPOT (ms): 92.04
P99 TPOT (ms): 1471.85
---------------Inter-token Latency----------------
Mean ITL (ms): 79.15
Median ITL (ms): 58.50
P99 ITL (ms): 100.53
==================================================
Attention Backend: fa3 + fa3 Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length= 32768 \
--attention-backend nsa \
--nsa-prefill-backend fa3 \
--nsa-decode-backend fa3
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 46.37
Total input tokens: 219111
Total generated tokens: 196786
Request throughput (req/s): 21.56
Output token throughput (tok/s): 4243.46
Peak output token throughput (tok/s): 13860.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 8968.32
---------------Time to First Token----------------
Mean TTFT (ms): 5303.50
Median TTFT (ms): 5145.47
P99 TTFT (ms): 9366.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 203.05
Median TPOT (ms): 85.23
P99 TPOT (ms): 1399.48
---------------Inter-token Latency----------------
Mean ITL (ms): 74.03
Median ITL (ms): 53.86
P99 ITL (ms): 177.03
==================================================
Attention Backend: flashmla_sparse + flashmla_kv Serving script python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --host 0 .0.0.0 --port 8000 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 \
--enable-dp-attention \
--dp 8 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length= 32768 \
--attention-backend nsa \
--nsa-prefill-backend flashmla_sparse \
--nsa-decode-backend flashmla_kv
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 77.68
Total input tokens: 219111
Total generated tokens: 197443
Request throughput (req/s): 12.87
Output token throughput (tok/s): 2541.62
Peak output token throughput (tok/s): 8352.00
Peak concurrent requests: 1000.00
Total Token throughput (tok/s): 5362.16
---------------Time to First Token----------------
Mean TTFT (ms): 4811.25
Median TTFT (ms): 4727.17
P99 TTFT (ms): 9086.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 260.25
Median TPOT (ms): 128.33
P99 TPOT (ms): 1625.53
---------------Inter-token Latency----------------
Mean ITL (ms): 114.47
Median ITL (ms): 91.44
P99 ITL (ms): 256.14
==================================================
4. Optimizing TensorRT-LLM Parallelism: TP+EP Serving script trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8 --ep_size 8 --pp_size 1
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 142.99
Total input tokens: 216405
Total generated tokens: 195740
Request throughput (req/s): 6.88
Output token throughput (tok/s): 1368.90
Peak output token throughput (tok/s): 4635.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 2882.32
---------------Time to First Token----------------
Mean TTFT (ms): 24155.45
Median TTFT (ms): 23945.42
P99 TTFT (ms): 45683.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 561.24
Median TPOT (ms): 312.75
P99 TPOT (ms): 1473.03
---------------Inter-token Latency----------------
Mean ITL (ms): 248.40
Median ITL (ms): 142.18
P99 ITL (ms): 1819.83
==================================================
Turn off DeepGEMM Serving script export TRTLLM_DG_ENABLED = 0
trtllm-serve /workspace/DeepSeek-v3.2 --tp_size 8 --ep_size 8 --pp_size 1
Benchmark result ============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration (s): 134.65
Total input tokens: 216484
Total generated tokens: 194509
Request throughput (req/s): 7.31
Output token throughput (tok/s): 1444.56
Peak output token throughput (tok/s): 4661.00
Peak concurrent requests: 984.00
Total Token throughput (tok/s): 3052.33
---------------Time to First Token----------------
Mean TTFT (ms): 23229.65
Median TTFT (ms): 23100.28
P99 TTFT (ms): 44648.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 541.07
Median TPOT (ms): 264.04
P99 TPOT (ms): 1588.04
---------------Inter-token Latency----------------
Mean ITL (ms): 226.30
Median ITL (ms): 137.80
P99 ITL (ms): 1624.47
==================================================
Summary of Optimization Options Optimization Option Throughput Improvement Parallelism +28.66% Tool Call Configuration +13.94% Context Length Adjust +4.47% Attention Backend +2.49% KV Cache Dtype - MTP - DeepGEMM - Total(vs Baseline) +56.97%
Other Benchmark Cases We further benchmarked the optimized configuration to evaluate its generalization under various workloads.
Baseline serving script vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 --tokenizer-mode deepseek_v32 --reasoning-parser deepseek_v3
Baseline benchmark results # random 128 input
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration ( s) : 13 .11
Total input tokens: 128000
Total generated tokens: 4000
Request throughput ( req/s) : 76 .30
Output token throughput ( tok/s) : 305 .20
Peak output token throughput ( tok/s) : 1029 .00
Peak concurrent requests: 1000 .00
Total Token throughput ( tok/s) : 10071 .49
---------------Time to First Token----------------
Mean TTFT ( ms) : 5786 .15
Median TTFT ( ms) : 4666 .02
P99 TTFT ( ms) : 12922 .46
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 781 .80
Median TPOT ( ms) : 687 .63
P99 TPOT ( ms) : 1637 .52
---------------Inter-token Latency----------------
Mean ITL ( ms) : 586 .79
Median ITL ( ms) : 664 .73
P99 ITL ( ms) : 3545 .51
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration ( s) : 96 .10
Total input tokens: 1000000
Total generated tokens: 50000
Request throughput ( req/s) : 5 .20
Output token throughput ( tok/s) : 520 .27
Peak output token throughput ( tok/s) : 4085 .00
Peak concurrent requests: 500 .00
Total Token throughput ( tok/s) : 10925 .59
---------------Time to First Token----------------
Mean TTFT ( ms) : 45667 .57
Median TTFT ( ms) : 43997 .53
P99 TTFT ( ms) : 87384 .57
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 354 .59
Median TPOT ( ms) : 433 .59
P99 TPOT ( ms) : 452 .30
---------------Inter-token Latency----------------
Mean ITL ( ms) : 353 .26
Median ITL ( ms) : 68 .59
P99 ITL ( ms) : 661 .87
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration ( s) : 210 .54
Total input tokens: 2000000
Total generated tokens: 100000
Request throughput ( req/s) : 2 .37
Output token throughput ( tok/s) : 474 .96
Peak output token throughput ( tok/s) : 2769 .00
Peak concurrent requests: 500 .00
Total Token throughput ( tok/s) : 9974 .26
---------------Time to First Token----------------
Mean TTFT ( ms) : 103386 .88
Median TTFT ( ms) : 97324 .68
P99 TTFT ( ms) : 200514 .23
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 226 .74
Median TPOT ( ms) : 246 .98
P99 TPOT ( ms) : 275 .68
---------------Inter-token Latency----------------
Mean ITL ( ms) : 228 .79
Median ITL ( ms) : 49 .50
P99 ITL ( ms) : 683 .63
==================================================
# random 32k input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration ( s) : 330 .61
Total input tokens: 3200000
Total generated tokens: 10000
Request throughput ( req/s) : 0 .30
Output token throughput ( tok/s) : 30 .25
Peak output token throughput ( tok/s) : 384 .00
Peak concurrent requests: 100 .00
Total Token throughput ( tok/s) : 9709 .27
---------------Time to First Token----------------
Mean TTFT ( ms) : 164003 .81
Median TTFT ( ms) : 164174 .93
P99 TTFT ( ms) : 325134 .01
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 472 .52
Median TPOT ( ms) : 512 .36
P99 TPOT ( ms) : 514 .39
---------------Inter-token Latency----------------
Mean ITL ( ms) : 477 .69
Median ITL ( ms) : 741 .90
P99 ITL ( ms) : 949 .57
==================================================
1k random input + 2k generation
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration ( s) : 96 .39
Total input tokens: 100000
Total generated tokens: 200000
Request throughput ( req/s) : 1 .04
Output token throughput ( tok/s) : 2075 .01
Peak output token throughput ( tok/s) : 2598 .00
Peak concurrent requests: 100 .00
Total Token throughput ( tok/s) : 3112 .52
---------------Time to First Token----------------
Mean TTFT ( ms) : 4789 .20
Median TTFT ( ms) : 4590 .83
P99 TTFT ( ms) : 10608 .88
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 45 .72
Median TPOT ( ms) : 45 .83
P99 TPOT ( ms) : 47 .44
---------------Inter-token Latency----------------
Mean ITL ( ms) : 46 .34
Median ITL ( ms) : 43 .13
P99 ITL ( ms) : 44 .29
==================================================
Optimized serving script python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2 \
--chat-template ./tool_chat_template_deepseekv32.jinja \
--tp-size 8 --dp-size 8 --enable-dp-attention \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32 \
--context-length= 32768 \
--attention-backend nsa \
--nsa-prefill-backend fa3 \
--nsa-decode-backend fa3
Optimized benchmark results # random 128 input
============ Serving Benchmark Result ============
Successful requests: 984
Benchmark duration ( s) : 7 .13
Total input tokens: 125952
Total generated tokens: 3936
Request throughput ( req/s) : 138 .09
Output token throughput ( tok/s) : 552 .34
Peak output token throughput ( tok/s) : 2526 .00
Peak concurrent requests: 984 .00
Total Token throughput ( tok/s) : 18227 .38
---------------Time to First Token----------------
Mean TTFT ( ms) : 4441 .81
Median TTFT ( ms) : 4685 .89
P99 TTFT ( ms) : 6735 .22
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 776 .55
Median TPOT ( ms) : 714 .97
P99 TPOT ( ms) : 1659 .07
---------------Inter-token Latency----------------
Mean ITL ( ms) : 465 .93
Median ITL ( ms) : 59 .97
P99 ITL ( ms) : 4564 .16
==================================================
# random 2K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration ( s) : 37 .89
Total input tokens: 1000000
Total generated tokens: 50000
Request throughput ( req/s) : 13 .20
Output token throughput ( tok/s) : 1319 .64
Peak output token throughput ( tok/s) : 9477 .00
Peak concurrent requests: 500 .00
Total Token throughput ( tok/s) : 27712 .54
---------------Time to First Token----------------
Mean TTFT ( ms) : 18696 .78
Median TTFT ( ms) : 18939 .89
P99 TTFT ( ms) : 32138 .14
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 192 .24
Median TPOT ( ms) : 189 .20
P99 TPOT ( ms) : 349 .96
---------------Inter-token Latency----------------
Mean ITL ( ms) : 189 .18
Median ITL ( ms) : 55 .05
P99 ITL ( ms) : 219 .74
==================================================
# random 4K input
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration ( s) : 102 .21
Total input tokens: 2000000
Total generated tokens: 100000
Request throughput ( req/s) : 4 .89
Output token throughput ( tok/s) : 978 .37
Peak output token throughput ( tok/s) : 7180 .00
Peak concurrent requests: 500 .00
Total Token throughput ( tok/s) : 20545 .67
---------------Time to First Token----------------
Mean TTFT ( ms) : 44019 .81
Median TTFT ( ms) : 40806 .24
P99 TTFT ( ms) : 90171 .38
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 177 .95
Median TPOT ( ms) : 161 .14
P99 TPOT ( ms) : 321 .10
---------------Inter-token Latency----------------
Mean ITL ( ms) : 177 .57
Median ITL ( ms) : 49 .55
P99 ITL ( ms) : 224 .52
==================================================
# random 32k input
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration ( s) : 160 .14
Total input tokens: 3200000
Total generated tokens: 10000
Request throughput ( req/s) : 0 .62
Output token throughput ( tok/s) : 62 .45
Peak output token throughput ( tok/s) : 1189 .00
Peak concurrent requests: 100 .00
Total Token throughput ( tok/s) : 20045 .18
---------------Time to First Token----------------
Mean TTFT ( ms) : 83964 .72
Median TTFT ( ms) : 88265 .70
P99 TTFT ( ms) : 157005 .30
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 246 .26
Median TPOT ( ms) : 229 .67
P99 TPOT ( ms) : 514 .42
---------------Inter-token Latency----------------
Mean ITL ( ms) : 244 .63
Median ITL ( ms) : 33 .21
P99 ITL ( ms) : 898 .58
==================================================
1k random input + 2k generation
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration ( s) : 80 .99
Total input tokens: 100000
Total generated tokens: 200000
Request throughput ( req/s) : 1 .23
Output token throughput ( tok/s) : 2469 .32
Peak output token throughput ( tok/s) : 2800 .00
Peak concurrent requests: 100 .00
Total Token throughput ( tok/s) : 3703 .98
---------------Time to First Token----------------
Mean TTFT ( ms) : 2103 .54
Median TTFT ( ms) : 2216 .52
P99 TTFT ( ms) : 3514 .77
-----Time per Output Token ( excl. 1st token) ------
Mean TPOT ( ms) : 39 .45
Median TPOT ( ms) : 39 .40
P99 TPOT ( ms) : 40 .17
---------------Inter-token Latency----------------
Mean ITL ( ms) : 40 .56
Median ITL ( ms) : 38 .90
P99 ITL ( ms) : 43 .63
==================================================