Ray Serve LLM Performance Benchmarks

Reproducible benchmark snapshots exploring different aspects of Ray Serve LLM performance. Each benchmark includes experiments, tooling, and debugging utilities to help replicate the approach on different workloads, models, and hardware stacks.

Benchmarks

Horizontal Scaling

Demonstrates that Ray Serve scales horizontally with linear throughput gains. When you scale the number of replicas in a deployment from 1x → 2x → 4x → 8x, you'll see proportional increases in aggregate throughput. Includes:

Benchmark methodology accounting for client scaling, symmetric component scaling, and prefix caching considerations
Comparison between Ray Serve with open-source Ray and Anyscale runtime optimizations
Tools for running benchmarks at different replica counts and concurrency levels
Visualization showing consistent per-replica throughput across scaling factors

See horizontal_scaling/README.md to get started.

Prefill-Decode Disaggregation

Systematic exploration of PD disaggregation performance on real hardware (GPT-OSS-120B on 2× p5.48xlarge with H100s and EFA). Includes:

Three narrative experiments showing when PD helps and configuration trade-offs
Working NIXL + UCX + EFA setup for multi-node KV cache transfer
Tools for benchmarking, visualization, and debugging

See prefill_decode/README.md to get started.

Composing KV Connectors with MultiConnector

Demonstrates composing multiple KV transfer backends using MultiConnector. Shows how to combine NIXL (for GPU-to-GPU transfers) with LMCache (for local offloading) in PD deployments:

Four configurations comparing baseline, LMCache-only, NIXL-only, and combined approaches
Config-driven YAML deployment pattern with custom builders
Performance analysis showing composability overhead

See pd+kv_offloading/README.md for details.

Replica Initialization

Measures how fast Ray Serve LLM replicas can cold-start and serve their first request, exploring optimization strategies for autoscaling scenarios. Includes:

Experiments across three model sizes (8B, 70B, 235B) testing different optimization strategies
Model streaming from S3 to reduce initialization time
Torch compile caching to accelerate model compilation
Performance visualizations showing initialization time improvements
Tools for measuring TTFT (time-to-first-token) including replica initialization overhead

See replica_initialization/README.md for details.

Note: This repository is actively evolving. Additional benchmark angles will be added as we explore different Ray Serve LLM optimization strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ray Serve LLM Performance Benchmarks

Table of Contents

Benchmarks

Horizontal Scaling

Prefill-Decode Disaggregation

Composing KV Connectors with MultiConnector

Replica Initialization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
horizontal_scaling		horizontal_scaling
pd+kv_offloading		pd+kv_offloading
prefill_decode		prefill_decode
replica_initialization		replica_initialization
.gitignore		.gitignore
README.md		README.md

anyscale/ray-serve-llm-perf-examples

Folders and files

Latest commit

History

Repository files navigation

Ray Serve LLM Performance Benchmarks

Table of Contents

Benchmarks

Horizontal Scaling

Prefill-Decode Disaggregation

Composing KV Connectors with MultiConnector

Replica Initialization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages