Skip to content

Evaluating LMCache Prefill Acceleration in vLLM

LMCache is an extensible KV Cache Layer for LLM inference designed to address key challenges in large-scale deployment scenarios. This documentation evaluates the performance impact of LMCache on vLLM inference, particularly focusing on prefill stage acceleration and its implications for various workload patterns.

Conclusions

  1. LMCache provides significant prefill acceleration in scenarios with high cache hit rates, achieving up to +355.3% input TPS improvement and -58.8% reduction in TTFT for long-context (20K tokens) multi-turn conversations in the experiments.

  2. Performance benefits are highly workload-dependent: - Optimal scenarios: Multi-turn conversations with shared prefixes and repeated patterns - Suboptimal scenarios: Random inputs with no cache reuse patterns

  3. Chunk size optimization The default 256 chunk size shows the optimal results in tested configurations.

  4. Cache miss scenarios incur overhead, showing -3% to -15% performance degradation when no cache reuse occurs, making LMCache most suitable for workloads with predictable prefix patterns.

Technical Background

LMCache Overview

LMCache extends vLLM's KV cache capabilities through:

Component Description
CPU Offloading Extends cache capacity beyond GPU VRAM limits
Chunk-based Management Efficient cache storage and retrieval with configurable chunk sizes
Multiple Backends Support for local storage, Redis, and custom backends like Mooncake
Distributed KV Cache Shared cache across multiple vLLM instances

Key Use Cases

  1. Low Prefix Cache Hit Rates: Mitigates GPU VRAM limitations and cache eviction issues
  2. Distributed Cache Sharing: Enables cache sharing across multiple vLLM instances
  3. PD Disaggregation: Supports disaggregated deployment architectures

Experimental Setup

  • Model: Qwen3-8B
  • Hardware: NVIDIA RTX 4090 24GB
  • vLLM Version: v0.10.1.1
  • Benchmark Method: Multi-turn conversation benchmark
Serving Commands
# Standard vLLM serving
vllm serve Qwen/Qwen3-8B

# LMCache-enabled serving
##### lmcache_config.yaml
chunk_size: 256
local_cpu: true
max_local_cpu_size: 50
#####
LMCACHE_CONFIG_FILE=/root/lmcache_config.yaml vllm serve /root/Qwen3-8B \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Benchmark Scripts
# Multi-turn bench scripts
# Ref: https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn

##### generate_multi_turn.json
{
    "filetype": "generate_conversations",
    "num_conversations": 24,
    "text_files": ["pg1184.txt"],
    "print_stats": false,
    "prompt_input": {
        "num_turns": {
            "distribution": "uniform",
            "min": 12,
            "max": 18
        },
        "common_prefix_num_tokens": {
            "distribution": "constant",
            "value": 500
        },
        "prefix_num_tokens": {
            "distribution": "lognormal",
            "average": 4000,
            "max": 20000
        },
        "num_tokens": {
            "distribution": "uniform",
            "min": 120,
            "max": 160
        }
    },
    "prompt_output": {
        "num_tokens": {
            "distribution": "uniform",
            "min": 80,
            "max": 120
        }
    }
}
#####

python benchmark_serving_multi_turn.py --model $MODEL_PATH --input-file generate_multi_turn.json --num-clients 10 --max-active-conversations 10

Experimental Results

Multi-turn Conversation Performance

5K Input Tokens

Configuration Input TPS Total TPS Mean TTFT (ms) Mean TPOT (ms)
Without LMCache 5849 5957 4350.48 48.47
With LMCache 9426 (+61.2%) 9592 2646.09 (-39.2%) 30.60 (-36.9%)

20K Input Tokens

Configuration Input TPS Total TPS Mean TTFT (ms) Mean TPOT (ms)
Without LMCache 4312.17 4335.71 5070.52 33.91
With LMCache 7750.60 (+79.7%) 7792.92 2091.00 (-58.8%) 25.83 (-23.8%)

20K Input Tokens + 1 Output Token

Configuration Input TPS Total TPS Mean TTFT (ms)
Without LMCache 7443.2 7443.6 4658.66
With LMCache 33887.9 (+355.3%) 33889.8 980.87

Tuning Chunk Size

Chunk Size Input TPS Performance Gain Mean TTFT (ms)
64 33820.3 +354.4% 985.28
256 33887.9 +355.3% 980.87
1024 31634.0 +325.0% 1055.69

Cache Miss Scenarios (Random Dataset)

Benchmark Scripts
vllm bench serve --model Qwen/Qwen3-8B --endpoint-type openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 1024 --random-output-len 128 --num-prompts 100 --seed 40

1K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 579.86 561.44 -3.2%
Total TPS 5212.32 5046.72 -3.2%
Mean TTFT (ms) 8886.36 9242.72 +4.0%
Mean TPOT (ms) 42.08 43.47 +3.3%

8K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 77.87 66.77 -14.3%
Total TPS 5060.79 4338.96 -14.3%
Mean TTFT (ms) 80610.70 92682.22 +15.0%
Mean TPOT (ms) 43.33 42.27 -2.4%

20K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 22.97 21.77 -5.2%
Total TPS 3698.09 3504.41 -5.2%
Mean TTFT (ms) 277456.13 292811.62 +5.5%
Mean TPOT (ms) 31.68 32.80 +3.5%

All VRAM KV Cache Hit Scenarios

1K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 5954.33 5752.71 -3.3%
Total TPS 53589.01 51802.45 -3.3%
Mean TTFT (ms) 3052.08 3247.10 +6.4%
Mean TPOT (ms) 38.40 39.04 +1.7%

8K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 3676.71 3656.30 -0.6%
Total TPS 238986.41 237659.44 -0.6%
Mean TTFT (ms) 5060.41 5326.37 +5.3%
Mean TPOT (ms) 54.37 53.86 -1.0%

20K Input Tokens

Metric Without LMCache With LMCache Change
Output TPS 2213.12 1972.32 -10.9%
Total TPS 356312.70 317543.74 -10.9%
Mean TTFT (ms) 9649.76 10109.51 +4.8%
Mean TPOT (ms) 87.10 94.26 +8.2%

Remote Storage Backend Performance (20K Tokens TTFT)

Backend Cache Miss (s) Cache Hit (s) Performance Boost
lmcache_server 0.739 0.324 2.28x
Redis 0.746 0.388 1.92x
Mooncake (TCP) 0.759 0.362 2.10x