Skip to content

Optimizing GPT-OSS-120B Throughput on NVIDIA H100 GPUs

Conclusion

gpt-oss-120b-h100

Recommended configuration for optimizing throughput of GPT-OSS-120B on H100 GPUs:

Serving Command
vllm serve openai/gpt-oss-120b --async-scheduling -tp 2

Comparison of benchmark results before and after optimization:

Benchmark Case baseline (vLLM without any optimizations) Optimized
ShareGPT Total TPS: 6095.37
Mean TPOT(ms): 54.10
Total TPS: 16042.19 (+31.6%/GPU)
Mean TPOT(ms): 86.54
Short Prompt Total TPS: 17059.15
Mean TPOT(ms): 304.80
Total TPS: 31517.47 (-7.6%/GPU)
Mean TPOT(ms): 183.05
Medium Prompt Total TPS: 14385.69
Mean TPOT(ms): 41.35
Total TPS: 34083.02 (+18.5%/GPU)
Mean TPOT(ms): 139.32
Long Prompt Total TPS: 14247.47
Mean TPOT(ms): 31.70
Total TPS: 34491.13 (+21.1%/GPU)
Mean TPOT(ms): 164.73
Very Long Prompt Total TPS: 16136.57
Mean TPOT(ms): 23.99
Total TPS: 32471.01 (+0.6%/GPU)
Mean TPOT(ms): 220.32

Note

  1. Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum.
  2. There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, Torch Compile, etc. The conclusions in this document can serve as a starting point for more targeted optimizations.
  3. The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions.

If there are any missing points or updates reflecting new changes, please let us know.

Optimization Objective

Achieve high throughput under high-concurrency request scenarios.

Experimental Setup

Model

GPT-OSS-120B

Hardware

NVIDIA H100 GPUs

Engine Version

  • vLLM: v0.10.2
  • SGLang: v0.5.3rc0
  • TensorRT-LLM: v1.0.0

Benchmark Dataset

  1. ShareGPT
  2. Random dataset with varying sequence lengths:
    • Very long prompt: 32000 input tokens, 100 output tokens
    • Long prompt: 4000 input tokens, 200 output tokens
    • Medium prompt: 2000 input tokens, 100 output tokens
    • Short prompt: 128 input tokens, 4 output tokens

Benchmark Script

We use the vLLM bench CLI tool to benchmark the model performance. The following command is used to run the benchmark:

# Prepare the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# Benchmark on ShareGPT dataset
vllm bench serve --model openai/gpt-oss-120b --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

# Benchmark on random dataset (fixed seed for reproducibility)
vllm bench serve --model openai/gpt-oss-120b --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42

Experiment Results

1. Choosing the Inference Engine

vLLM

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  67.40
Total input tokens:                      215312
Total generated tokens:                  195544
Request throughput (req/s):              14.84
Output token throughput (tok/s):         2901.05
Total Token throughput (tok/s):          6095.37
---------------Time to First Token----------------
Mean TTFT (ms):                          24830.45
Median TTFT (ms):                        25973.43
P99 TTFT (ms):                           54033.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.10
Median TPOT (ms):                        47.57
P99 TPOT (ms):                           193.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           46.65
Median ITL (ms):                         37.98
P99 ITL (ms):                            121.49
==================================================

SGLang

Serving script
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --host 0.0.0.0 --port 8000 
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  227.00
Total input tokens:                      215312
Total generated tokens:                  196540
Request throughput (req/s):              4.41
Output token throughput (tok/s):         865.81
Total Token throughput (tok/s):          1814.32
---------------Time to First Token----------------
Mean TTFT (ms):                          132612.71
Median TTFT (ms):                        139831.95
P99 TTFT (ms):                           216875.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.34
Median TPOT (ms):                        39.11
P99 TPOT (ms):                           345.30
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.23
Median ITL (ms):                         21.25
P99 ITL (ms):                            1328.36
==================================================

TensorRT-LLM

Serving script
trtllm-serve openai/gpt-oss-120b --max_seq_len 32768
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  191.25
Total input tokens:                      215312
Total generated tokens:                  193874
Request throughput (req/s):              5.23
Output token throughput (tok/s):         1013.72
Total Token throughput (tok/s):          2139.54
---------------Time to First Token----------------
Mean TTFT (ms):                          55896.64
Median TTFT (ms):                        49891.09
P99 TTFT (ms):                           138440.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          233.20
Median TPOT (ms):                        213.25
P99 TPOT (ms):                           790.11
---------------Inter-token Latency----------------
Mean ITL (ms):                           101.13
Median ITL (ms):                         53.81
P99 ITL (ms):                            389.65
==================================================

Result: vLLM (6095.37 tok/s) > TensorRT-LLM (2139.54 tok/s) > SGLang(1814.32 tok/s)

2. Async scheduling in vLLM

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768 --async-scheduling
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  63.65
Total input tokens:                      215312
Total generated tokens:                  187289
Request throughput (req/s):              15.71
Output token throughput (tok/s):         2942.57
Total Token throughput (tok/s):          6325.41
---------------Time to First Token----------------
Mean TTFT (ms):                          23629.95
Median TTFT (ms):                        24595.25
P99 TTFT (ms):                           51018.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.66
Median TPOT (ms):                        46.51
P99 TPOT (ms):                           182.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.31
Median ITL (ms):                         36.40
P99 ITL (ms):                            111.51
==================================================

Result: Throughput improved from 6095.37 tok/s to 6325.41 tok/s by enabling async scheduling.

3. Parallelism in vLLM

TP2

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768 --async-scheduling -tp 2
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  25.65
Total input tokens:                      215312
Total generated tokens:                  196205
Request throughput (req/s):              38.98
Output token throughput (tok/s):         7648.67
Total Token throughput (tok/s):          16042.19
---------------Time to First Token----------------
Mean TTFT (ms):                          4612.68
Median TTFT (ms):                        4600.39
P99 TTFT (ms):                           7587.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.54
Median TPOT (ms):                        51.57
P99 TPOT (ms):                           215.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.02
Median ITL (ms):                         31.37
P99 ITL (ms):                            223.98
==================================================

TP4

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768 --async-scheduling -tp 4
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  17.32
Total input tokens:                      215312
Total generated tokens:                  196174
Request throughput (req/s):              57.74
Output token throughput (tok/s):         11327.22
Total Token throughput (tok/s):          23759.49
---------------Time to First Token----------------
Mean TTFT (ms):                          2952.04
Median TTFT (ms):                        2871.45
P99 TTFT (ms):                           4867.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.77
Median TPOT (ms):                        34.49
P99 TPOT (ms):                           136.33
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.57
Median ITL (ms):                         21.75
P99 ITL (ms):                            149.71
==================================================

PP2

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768 --async-scheduling -pp 2
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  36.74
Total input tokens:                      215312
Total generated tokens:                  196089
Request throughput (req/s):              27.22
Output token throughput (tok/s):         5337.65
Total Token throughput (tok/s):          11198.56
---------------Time to First Token----------------
Mean TTFT (ms):                          4471.76
Median TTFT (ms):                        4392.85
P99 TTFT (ms):                           6960.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          115.37
Median TPOT (ms):                        69.62
P99 TPOT (ms):                           371.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.87
Median ITL (ms):                         50.83
P99 ITL (ms):                            374.02
==================================================

Result: TP2 improves throughput from 6325.41 tok/s to 16042.19 tok/s, achieving 26.8% higher token throughput per GPU.

4. Attention Backend in vLLM

FlashAttention is the default.

FlashInfer

Serving script
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve openai/gpt-oss-120b --max-model-len 32768
Benchmark result
RuntimeError: Worker failed with error 'FlashInfer backend currently does not support attention sinks, please use trtllm on blackwell or flash attention on earlier GPUs.', please check the stack trace above for the root cause

XFormers

Serving script
VLLM_ATTENTION_BACKEND=XFORMERS vllm serve openai/gpt-oss-120b --max-model-len 32768
Benchmark result
TypeError: XFormersImpl.__init__() got an unexpected keyword argument 'sinks'

5. Max Number of Batched Tokens in vLLM

Serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768 --async-scheduling --max-num-batched-tokens 16384
Benchmark result
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  65.28
Total input tokens:                      215312
Total generated tokens:                  187636
Request throughput (req/s):              15.32
Output token throughput (tok/s):         2874.47
Total Token throughput (tok/s):          6172.93
---------------Time to First Token----------------
Mean TTFT (ms):                          23977.21
Median TTFT (ms):                        24694.18
P99 TTFT (ms):                           51339.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.18
Median TPOT (ms):                        45.25
P99 TPOT (ms):                           164.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.26
Median ITL (ms):                         35.89
P99 ITL (ms):                            109.05
==================================================

Result: Throughput slightly decreased when increasing max-num-batched-tokens.

Summary of Optimization Options

Optimization Option Throughput Improvement
Engine Selection -
Async Scheduling +3.8%
Parallelism +26.8%
Attention Backend -
Max Number of Batched Tokens -

Other Benchmark Cases

We further benchmarked the optimized configuration to evaluate its generalization under various workloads.

Baseline serving script
vllm serve openai/gpt-oss-120b --max-model-len 32768
Baseline benchmark results
# random 32K input
============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  198.92
Total input tokens:                      3200000
Total generated tokens:                  9877
Request throughput (req/s):              0.50
Output token throughput (tok/s):         49.65
Total Token throughput (tok/s):          16136.57
---------------Time to First Token----------------
Mean TTFT (ms):                          99444.04
Median TTFT (ms):                        99439.56
P99 TTFT (ms):                           196214.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.99
Median TPOT (ms):                        23.69
P99 TPOT (ms):                           31.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.13
Median ITL (ms):                         7.26
P99 ITL (ms):                            454.68
==================================================

# random 4K input
============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  145.04
Total input tokens:                      1999249
Total generated tokens:                  67205
Request throughput (req/s):              3.45
Output token throughput (tok/s):         463.35
Total Token throughput (tok/s):          14247.47
---------------Time to First Token----------------
Mean TTFT (ms):                          72954.68
Median TTFT (ms):                        73023.24
P99 TTFT (ms):                           141744.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.70
Median TPOT (ms):                        32.02
P99 TPOT (ms):                           37.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.25
Median ITL (ms):                         12.37
P99 ITL (ms):                            191.40
==================================================

# random 2K input
============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  72.87
Total input tokens:                      999120
Total generated tokens:                  49197
Request throughput (req/s):              6.86
Output token throughput (tok/s):         675.11
Total Token throughput (tok/s):          14385.69
---------------Time to First Token----------------
Mean TTFT (ms):                          36234.76
Median TTFT (ms):                        35662.58
P99 TTFT (ms):                           70634.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.35
Median TPOT (ms):                        42.86
P99 TPOT (ms):                           48.36
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.61
Median ITL (ms):                         14.92
P99 ITL (ms):                            309.43
==================================================

# random 128 input
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  7.72
Total input tokens:                      127755
Total generated tokens:                  4000
Request throughput (req/s):              129.48
Output token throughput (tok/s):         517.91
Total Token throughput (tok/s):          17059.15
---------------Time to First Token----------------
Mean TTFT (ms):                          4850.46
Median TTFT (ms):                        4987.41
P99 TTFT (ms):                           7372.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          304.80
Median TPOT (ms):                        341.94
P99 TPOT (ms):                           360.40
---------------Inter-token Latency----------------
Mean ITL (ms):                           467.00
Median ITL (ms):                         352.58
P99 ITL (ms):                            728.75
==================================================

# ShareGPT batch size 4
============ Serving Benchmark Result ============
Successful requests:                     100
Maximum request concurrency:             4
Benchmark duration (s):                  47.88
Total input tokens:                      22946
Total generated tokens:                  21691
Request throughput (req/s):              2.09
Output token throughput (tok/s):         453.04
Total Token throughput (tok/s):          932.29
---------------Time to First Token----------------
Mean TTFT (ms):                          49.00
Median TTFT (ms):                        47.49
P99 TTFT (ms):                           107.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.31
Median TPOT (ms):                        8.24
P99 TPOT (ms):                           10.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.40
Median ITL (ms):                         7.99
P99 ITL (ms):                            23.32
==================================================
Optimized serving script
vllm serve openai/gpt-oss-120b --async-scheduling -tp 2
Optimized benchmark results
# random 32K input
============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  98.86
Total input tokens:                      3200000
Total generated tokens:                  9955
Request throughput (req/s):              1.01
Output token throughput (tok/s):         100.70
Total Token throughput (tok/s):          32471.01
---------------Time to First Token----------------
Mean TTFT (ms):                          48826.05
Median TTFT (ms):                        48671.00
P99 TTFT (ms):                           96651.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          220.32
Median TPOT (ms):                        252.27
P99 TPOT (ms):                           252.79
---------------Inter-token Latency----------------
Mean ITL (ms):                           228.81
Median ITL (ms):                         246.94
P99 ITL (ms):                            481.43
==================================================

# random 4K input
============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  59.94
Total input tokens:                      1999249
Total generated tokens:                  68046
Request throughput (req/s):              8.34
Output token throughput (tok/s):         1135.29
Total Token throughput (tok/s):          34491.13
---------------Time to First Token----------------
Mean TTFT (ms):                          29487.34
Median TTFT (ms):                        29235.61
P99 TTFT (ms):                           56201.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          164.73
Median TPOT (ms):                        199.03
P99 TPOT (ms):                           220.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           171.11
Median ITL (ms):                         215.59
P99 ITL (ms):                            431.23
==================================================

# random 2K input
============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  30.75
Total input tokens:                      999120
Total generated tokens:                  49059
Request throughput (req/s):              16.26
Output token throughput (tok/s):         1595.22
Total Token throughput (tok/s):          34083.02
---------------Time to First Token----------------
Mean TTFT (ms):                          15291.77
Median TTFT (ms):                        15170.31
P99 TTFT (ms):                           28437.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.32
Median TPOT (ms):                        148.04
P99 TPOT (ms):                           212.96
---------------Inter-token Latency----------------
Mean ITL (ms):                           147.37
Median ITL (ms):                         208.40
P99 ITL (ms):                            421.37
==================================================

# random 128 input
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  4.18
Total input tokens:                      127755
Total generated tokens:                  4000
Request throughput (req/s):              239.21
Output token throughput (tok/s):         956.85
Total Token throughput (tok/s):          31517.47
---------------Time to First Token----------------
Mean TTFT (ms):                          2800.15
Median TTFT (ms):                        2412.28
P99 TTFT (ms):                           3856.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          183.05
Median TPOT (ms):                        208.66
P99 TPOT (ms):                           232.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           274.85
Median ITL (ms):                         267.83
P99 ITL (ms):                            506.27
==================================================

# ShareGPT batch size 4
============ Serving Benchmark Result ============
Successful requests:                     100
Maximum request concurrency:             4
Benchmark duration (s):                  33.60
Total input tokens:                      22946
Total generated tokens:                  21691
Request throughput (req/s):              2.98
Output token throughput (tok/s):         645.55
Total Token throughput (tok/s):          1328.44
---------------Time to First Token----------------
Mean TTFT (ms):                          37.51
Median TTFT (ms):                        36.68
P99 TTFT (ms):                           58.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.75
Median TPOT (ms):                        5.74
P99 TPOT (ms):                           6.91
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.89
Median ITL (ms):                         5.54
P99 ITL (ms):                            23.43
==================================================