Evaluating LMCache Prefill Acceleration in vLLM
LMCache is an extensible KV Cache Layer for LLM inference designed to address key challenges in large-scale deployment scenarios. This documentation evaluates the performance impact of LMCache on vLLM inference, particularly focusing on prefill stage acceleration and its implications for various workload patterns.
Conclusions
LMCache provides significant prefill acceleration in scenarios with high cache hit rates, achieving up to +355.3% input TPS improvement and -58.8% reduction in TTFT for long-context (20K tokens) multi-turn conversations in the experiments.
Performance benefits are highly workload-dependent :
- Optimal scenarios : Multi-turn conversations with shared prefixes and repeated patterns
- Suboptimal scenarios : Random inputs with no cache reuse patterns
Chunk size optimization The default 256 chunk size shows the optimal results in tested configurations.
Cache miss scenarios incur overhead , showing -3% to -15% performance degradation when no cache reuse occurs, making LMCache most suitable for workloads with predictable prefix patterns.
Technical Background
LMCache Overview
LMCache extends vLLM's KV cache capabilities through:
Component
Description
CPU Offloading
Extends cache capacity beyond GPU VRAM limits
Chunk-based Management
Efficient cache storage and retrieval with configurable chunk sizes
Multiple Backends
Support for local storage, Redis, and custom backends like Mooncake
Distributed KV Cache
Shared cache across multiple vLLM instances
Key Use Cases
Low Prefix Cache Hit Rates : Mitigates GPU VRAM limitations and cache eviction issues
Distributed Cache Sharing : Enables cache sharing across multiple vLLM instances
PD Disaggregation : Supports disaggregated deployment architectures
Experimental Setup
Model : Qwen3-8B
Hardware : NVIDIA RTX 4090 24GB
vLLM Version : v0.10.1.1
Benchmark Method : Multi-turn conversation benchmark
Serving Commands
# Standard vLLM serving
vllm serve Qwen/Qwen3-8B
# LMCache-enabled serving
##### lmcache_config.yaml
chunk_size: 256
local_cpu: true
max_local_cpu_size: 50
#####
LMCACHE_CONFIG_FILE = /root/lmcache_config.yaml vllm serve /root/Qwen3-8B \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Benchmark Scripts
# Multi-turn bench scripts
# Ref: https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn
##### generate_multi_turn.json
{
"filetype" : "generate_conversations" ,
"num_conversations" : 24 ,
"text_files" : [ "pg1184.txt" ] ,
"print_stats" : false,
"prompt_input" : {
"num_turns" : {
"distribution" : "uniform" ,
"min" : 12 ,
"max" : 18
} ,
"common_prefix_num_tokens" : {
"distribution" : "constant" ,
"value" : 500
} ,
"prefix_num_tokens" : {
"distribution" : "lognormal" ,
"average" : 4000 ,
"max" : 20000
} ,
"num_tokens" : {
"distribution" : "uniform" ,
"min" : 120 ,
"max" : 160
}
} ,
"prompt_output" : {
"num_tokens" : {
"distribution" : "uniform" ,
"min" : 80 ,
"max" : 120
}
}
}
#####
python benchmark_serving_multi_turn.py --model $MODEL_PATH --input-file generate_multi_turn.json --num-clients 10 --max-active-conversations 10
Experimental Results
Configuration
Input TPS
Total TPS
Mean TTFT (ms)
Mean TPOT (ms)
Without LMCache
5849
5957
4350.48
48.47
With LMCache
9426 (+61.2%)
9592
2646.09 (-39.2%)
30.60 (-36.9%)
Configuration
Input TPS
Total TPS
Mean TTFT (ms)
Mean TPOT (ms)
Without LMCache
4312.17
4335.71
5070.52
33.91
With LMCache
7750.60 (+79.7%)
7792.92
2091.00 (-58.8%)
25.83 (-23.8%)
Configuration
Input TPS
Total TPS
Mean TTFT (ms)
Without LMCache
7443.2
7443.6
4658.66
With LMCache
33887.9 (+355.3%)
33889.8
980.87
Tuning Chunk Size
Chunk Size
Input TPS
Performance Gain
Mean TTFT (ms)
64
33820.3
+354.4%
985.28
256
33887.9
+355.3%
980.87
1024
31634.0
+325.0%
1055.69
Cache Miss Scenarios (Random Dataset)
Benchmark Scripts
vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 1024 --random-output-len 128 --num-prompts 100 --seed 40
Metric
Without LMCache
With LMCache
Change
Output TPS
579.86
561.44
-3.2%
Total TPS
5212.32
5046.72
-3.2%
Mean TTFT (ms)
8886.36
9242.72
+4.0%
Mean TPOT (ms)
42.08
43.47
+3.3%
Metric
Without LMCache
With LMCache
Change
Output TPS
77.87
66.77
-14.3%
Total TPS
5060.79
4338.96
-14.3%
Mean TTFT (ms)
80610.70
92682.22
+15.0%
Mean TPOT (ms)
43.33
42.27
-2.4%
Metric
Without LMCache
With LMCache
Change
Output TPS
22.97
21.77
-5.2%
Total TPS
3698.09
3504.41
-5.2%
Mean TTFT (ms)
277456.13
292811.62
+5.5%
Mean TPOT (ms)
31.68
32.80
+3.5%
All VRAM KV Cache Hit Scenarios
Metric
Without LMCache
With LMCache
Change
Output TPS
5954.33
5752.71
-3.3%
Total TPS
53589.01
51802.45
-3.3%
Mean TTFT (ms)
3052.08
3247.10
+6.4%
Mean TPOT (ms)
38.40
39.04
+1.7%
Metric
Without LMCache
With LMCache
Change
Output TPS
3676.71
3656.30
-0.6%
Total TPS
238986.41
237659.44
-0.6%
Mean TTFT (ms)
5060.41
5326.37
+5.3%
Mean TPOT (ms)
54.37
53.86
-1.0%
Metric
Without LMCache
With LMCache
Change
Output TPS
2213.12
1972.32
-10.9%
Total TPS
356312.70
317543.74
-10.9%
Mean TTFT (ms)
9649.76
10109.51
+4.8%
Mean TPOT (ms)
87.10
94.26
+8.2%
Backend
Cache Miss (s)
Cache Hit (s)
Performance Boost
lmcache_server
0.739
0.324
2.28x
Redis
0.746
0.388
1.92x
Mooncake (TCP)
0.759
0.362
2.10x