Evaluating LMCache Prefill Acceleration in vLLM LMCache is an extensible KV Cache Layer for LLM inference designed to address key challenges in large-scale deployment scenarios. This documentation evaluates the performance impact of LMCache on vLLM inference, particularly focusing on prefill stage acceleration and its implications for various workload patterns.
Conclusions LMCache provides significant prefill acceleration in scenarios with high cache hit rates, achieving up to +355.3% input TPS improvement and -58.8% reduction in TTFT for long-context (20K tokens) multi-turn conversations in the experiments.
Performance benefits are highly workload-dependent : - Optimal scenarios : Multi-turn conversations with shared prefixes and repeated patterns - Suboptimal scenarios : Random inputs with no cache reuse patterns
Chunk size optimization The default 256 chunk size shows the optimal results in tested configurations.
Cache miss scenarios incur overhead , showing -3% to -15% performance degradation when no cache reuse occurs, making LMCache most suitable for workloads with predictable prefix patterns.
Technical Background LMCache Overview LMCache extends vLLM's KV cache capabilities through:
Component Description CPU Offloading Extends cache capacity beyond GPU VRAM limits Chunk-based Management Efficient cache storage and retrieval with configurable chunk sizes Multiple Backends Support for local storage, Redis, and custom backends like Mooncake Distributed KV Cache Shared cache across multiple vLLM instances
Key Use Cases Low Prefix Cache Hit Rates : Mitigates GPU VRAM limitations and cache eviction issues Distributed Cache Sharing : Enables cache sharing across multiple vLLM instances PD Disaggregation : Supports disaggregated deployment architectures Experimental Setup Model : Qwen3-8B Hardware : NVIDIA RTX 4090 24GB vLLM Version : v0.10.1.1 Benchmark Method : Multi-turn conversation benchmark Serving Commands # Standard vLLM serving
vllm serve Qwen/Qwen3-8B
# LMCache-enabled serving
##### lmcache_config.yaml
chunk_size: 256
local_cpu: true
max_local_cpu_size: 50
#####
LMCACHE_CONFIG_FILE = /root/lmcache_config.yaml vllm serve /root/Qwen3-8B \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Benchmark Scripts # Multi-turn bench scripts
# Ref: https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn
##### generate_multi_turn.json
{
"filetype" : "generate_conversations" ,
"num_conversations" : 24 ,
"text_files" : [ "pg1184.txt" ] ,
"print_stats" : false,
"prompt_input" : {
"num_turns" : {
"distribution" : "uniform" ,
"min" : 12 ,
"max" : 18
} ,
"common_prefix_num_tokens" : {
"distribution" : "constant" ,
"value" : 500
} ,
"prefix_num_tokens" : {
"distribution" : "lognormal" ,
"average" : 4000 ,
"max" : 20000
} ,
"num_tokens" : {
"distribution" : "uniform" ,
"min" : 120 ,
"max" : 160
}
} ,
"prompt_output" : {
"num_tokens" : {
"distribution" : "uniform" ,
"min" : 80 ,
"max" : 120
}
}
}
#####
python benchmark_serving_multi_turn.py --model $MODEL_PATH --input-file generate_multi_turn.json --num-clients 10 --max-active-conversations 10
Experimental Results Configuration Input TPS Total TPS Mean TTFT (ms) Mean TPOT (ms) Without LMCache 5849 5957 4350.48 48.47 With LMCache 9426 (+61.2%) 9592 2646.09 (-39.2%) 30.60 (-36.9%)
Configuration Input TPS Total TPS Mean TTFT (ms) Mean TPOT (ms) Without LMCache 4312.17 4335.71 5070.52 33.91 With LMCache 7750.60 (+79.7%) 7792.92 2091.00 (-58.8%) 25.83 (-23.8%)
Configuration Input TPS Total TPS Mean TTFT (ms) Without LMCache 7443.2 7443.6 4658.66 With LMCache 33887.9 (+355.3%) 33889.8 980.87
Tuning Chunk Size Chunk Size Input TPS Performance Gain Mean TTFT (ms) 64 33820.3 +354.4% 985.28 256 33887.9 +355.3% 980.87 1024 31634.0 +325.0% 1055.69
Cache Miss Scenarios (Random Dataset) Benchmark Scripts vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 1024 --random-output-len 128 --num-prompts 100 --seed 40
Metric Without LMCache With LMCache Change Output TPS 579.86 561.44 -3.2% Total TPS 5212.32 5046.72 -3.2% Mean TTFT (ms) 8886.36 9242.72 +4.0% Mean TPOT (ms) 42.08 43.47 +3.3%
Metric Without LMCache With LMCache Change Output TPS 77.87 66.77 -14.3% Total TPS 5060.79 4338.96 -14.3% Mean TTFT (ms) 80610.70 92682.22 +15.0% Mean TPOT (ms) 43.33 42.27 -2.4%
Metric Without LMCache With LMCache Change Output TPS 22.97 21.77 -5.2% Total TPS 3698.09 3504.41 -5.2% Mean TTFT (ms) 277456.13 292811.62 +5.5% Mean TPOT (ms) 31.68 32.80 +3.5%
All VRAM KV Cache Hit Scenarios Metric Without LMCache With LMCache Change Output TPS 5954.33 5752.71 -3.3% Total TPS 53589.01 51802.45 -3.3% Mean TTFT (ms) 3052.08 3247.10 +6.4% Mean TPOT (ms) 38.40 39.04 +1.7%
Metric Without LMCache With LMCache Change Output TPS 3676.71 3656.30 -0.6% Total TPS 238986.41 237659.44 -0.6% Mean TTFT (ms) 5060.41 5326.37 +5.3% Mean TPOT (ms) 54.37 53.86 -1.0%
Metric Without LMCache With LMCache Change Output TPS 2213.12 1972.32 -10.9% Total TPS 356312.70 317543.74 -10.9% Mean TTFT (ms) 9649.76 10109.51 +4.8% Mean TPOT (ms) 87.10 94.26 +8.2%
Backend Cache Miss (s) Cache Hit (s) Performance Boost lmcache_server 0.739 0.324 2.28x Redis 0.746 0.388 1.92x Mooncake (TCP) 0.759 0.362 2.10x