Evaluating LMCache Prefill Acceleration in vLLM

LMCache is an extensible KV Cache Layer for LLM inference designed to address key challenges in large-scale deployment scenarios. This documentation evaluates the performance impact of LMCache on vLLM inference, particularly focusing on prefill stage acceleration and its implications for various workload patterns.

Conclusions

LMCache provides significant prefill acceleration in scenarios with high cache hit rates, achieving up to +355.3% input TPS improvement and -58.8% reduction in TTFT for long-context (20K tokens) multi-turn conversations in the experiments.
Performance benefits are highly workload-dependent: - Optimal scenarios: Multi-turn conversations with shared prefixes and repeated patterns - Suboptimal scenarios: Random inputs with no cache reuse patterns
Chunk size optimization The default 256 chunk size shows the optimal results in tested configurations.
Cache miss scenarios incur overhead, showing -3% to -15% performance degradation when no cache reuse occurs, making LMCache most suitable for workloads with predictable prefix patterns.

Technical Background

LMCache Overview

LMCache extends vLLM's KV cache capabilities through:

Component	Description
CPU Offloading	Extends cache capacity beyond GPU VRAM limits
Chunk-based Management	Efficient cache storage and retrieval with configurable chunk sizes
Multiple Backends	Support for local storage, Redis, and custom backends like Mooncake
Distributed KV Cache	Shared cache across multiple vLLM instances

Key Use Cases

Low Prefix Cache Hit Rates: Mitigates GPU VRAM limitations and cache eviction issues
Distributed Cache Sharing: Enables cache sharing across multiple vLLM instances
PD Disaggregation: Supports disaggregated deployment architectures

Experimental Setup

Model: Qwen3-8B
Hardware: NVIDIA RTX 4090 24GB
vLLM Version: v0.10.1.1
Benchmark Method: Multi-turn conversation benchmark

Serving Commands

# Standard vLLM serving
vllm serve Qwen/Qwen3-8B

# LMCache-enabled serving
##### lmcache_config.yaml
chunk_size: 256
local_cpu: true
max_local_cpu_size: 50
#####
LMCACHE_CONFIG_FILE=/root/lmcache_config.yaml vllm serve /root/Qwen3-8B \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

Benchmark Scripts

# Multi-turn bench scripts
# Ref: https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn

##### generate_multi_turn.json
{
    "filetype": "generate_conversations",
    "num_conversations": 24,
    "text_files": ["pg1184.txt"],
    "print_stats": false,
    "prompt_input": {
        "num_turns": {
            "distribution": "uniform",
            "min": 12,
            "max": 18
        },
        "common_prefix_num_tokens": {
            "distribution": "constant",
            "value": 500
        },
        "prefix_num_tokens": {
            "distribution": "lognormal",
            "average": 4000,
            "max": 20000
        },
        "num_tokens": {
            "distribution": "uniform",
            "min": 120,
            "max": 160
        }
    },
    "prompt_output": {
        "num_tokens": {
            "distribution": "uniform",
            "min": 80,
            "max": 120
        }
    }
}
#####

python benchmark_serving_multi_turn.py --model $MODEL_PATH --input-file generate_multi_turn.json --num-clients 10 --max-active-conversations 10

Experimental Results

Multi-turn Conversation Performance

5K Input Tokens

Configuration	Input TPS	Total TPS	Mean TTFT (ms)	Mean TPOT (ms)
Without LMCache	5849	5957	4350.48	48.47
With LMCache	9426 (+61.2%)	9592	2646.09 (-39.2%)	30.60 (-36.9%)

20K Input Tokens

Configuration	Input TPS	Total TPS	Mean TTFT (ms)	Mean TPOT (ms)
Without LMCache	4312.17	4335.71	5070.52	33.91
With LMCache	7750.60 (+79.7%)	7792.92	2091.00 (-58.8%)	25.83 (-23.8%)

20K Input Tokens + 1 Output Token

Configuration	Input TPS	Total TPS	Mean TTFT (ms)
Without LMCache	7443.2	7443.6	4658.66
With LMCache	33887.9 (+355.3%)	33889.8	980.87

Tuning Chunk Size

Chunk Size	Input TPS	Performance Gain	Mean TTFT (ms)
64	33820.3	+354.4%	985.28
256	33887.9	+355.3%	980.87
1024	31634.0	+325.0%	1055.69

Cache Miss Scenarios (Random Dataset)

Benchmark Scripts

1	`vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 1024 --random-output-len 128 --num-prompts 100 --seed 40`

1K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	579.86	561.44	-3.2%
Total TPS	5212.32	5046.72	-3.2%
Mean TTFT (ms)	8886.36	9242.72	+4.0%
Mean TPOT (ms)	42.08	43.47	+3.3%

8K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	77.87	66.77	-14.3%
Total TPS	5060.79	4338.96	-14.3%
Mean TTFT (ms)	80610.70	92682.22	+15.0%
Mean TPOT (ms)	43.33	42.27	-2.4%

20K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	22.97	21.77	-5.2%
Total TPS	3698.09	3504.41	-5.2%
Mean TTFT (ms)	277456.13	292811.62	+5.5%
Mean TPOT (ms)	31.68	32.80	+3.5%

All VRAM KV Cache Hit Scenarios

1K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	5954.33	5752.71	-3.3%
Total TPS	53589.01	51802.45	-3.3%
Mean TTFT (ms)	3052.08	3247.10	+6.4%
Mean TPOT (ms)	38.40	39.04	+1.7%

8K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	3676.71	3656.30	-0.6%
Total TPS	238986.41	237659.44	-0.6%
Mean TTFT (ms)	5060.41	5326.37	+5.3%
Mean TPOT (ms)	54.37	53.86	-1.0%

20K Input Tokens

Metric	Without LMCache	With LMCache	Change
Output TPS	2213.12	1972.32	-10.9%
Total TPS	356312.70	317543.74	-10.9%
Mean TTFT (ms)	9649.76	10109.51	+4.8%
Mean TPOT (ms)	87.10	94.26	+8.2%

Remote Storage Backend Performance (20K Tokens TTFT)

Backend	Cache Miss (s)	Cache Hit (s)	Performance Boost
lmcache_server	0.739	0.324	2.28x
Redis	0.746	0.388	1.92x
Mooncake (TCP)	0.759	0.362	2.10x