vLLM Engine-Direct Connection Ceiling Benchmark

A benchmark to find the highest in-flight concurrency that a vLLM engine can sustain for long prompts while meeting SLA requirements.

Project Structure

The project has been refactored into multiple modules for better organization:

main.py - Main entry point that orchestrates the benchmark
config.py - Configuration management and command-line argument parsing
prompt_builder.py - Utilities for constructing prompts of target token length
engine_manager.py - Engine initialization and sampling parameter creation
benchmark.py - Core benchmarking functions (streaming, concurrency testing, ceiling finding)
download_models.py - Utility to download models from Hugging Face to default cache
download_qwen_models.py - Batch download script for Qwen models

Usage

python3 main.py \
  --model meta-llama/Llama-3.1-8B \
  --target-input-tokens 5000 \
  --ttft-timeout 60 \
  --hold-seconds 5 \
  --sla-ok-rate 0.99

Model Downloads

The project includes utilities to download models from Hugging Face to the default cache location:

Download Qwen Models

# Download both Qwen3-30B-A3B and Qwen3-30B-A3B-FP8 models
make download-qwen-models

# Download a specific model
make download-model MODEL=Qwen/Qwen3-30B-A3B

Using Downloaded Models

Once downloaded, you can use the models with their original Hugging Face names:

# Use downloaded models with their original names
python main.py --model Qwen/Qwen3-30B-A3B
python main.py --model Qwen/Qwen3-30B-A3B-FP8

The models are stored in the default Hugging Face cache location and will be automatically found when referenced by their original model names.

Key Features

Direct Engine Connection: Avoids HTTP/gRPC overhead by driving vLLM's AsyncLLMEngine directly
Concurrency Ceiling Detection: Uses exponential ramp + binary search to find maximum sustainable concurrency
SLA Compliance: Ensures success rate and time-to-first-token (TTFT) thresholds are met
Flexible Prompt Building: Supports both tokenizer-based and character-based prompt construction
GPU Selection: Choose which GPU device to use via CUDA_VISIBLE_DEVICES environment variable

Common Engine Arguments

--dtype auto|float16|bfloat16
--tensor-parallel-size 1
--gpu-memory-utilization 0.9
--max-model-len 8192
--max-num-seqs 2048

GPU Selection

GPU selection is controlled via the CUDA_VISIBLE_DEVICES environment variable:

# Use first GPU
CUDA_VISIBLE_DEVICES=0 python3 main.py --model your-model

# Use second GPU
CUDA_VISIBLE_DEVICES=1 python3 main.py --model your-model

# Use multiple GPUs (for tensor parallelism)
CUDA_VISIBLE_DEVICES=0,1 python3 main.py --model your-model --tensor-parallel-size 2

Output

The benchmark outputs:

Real-time progress during ramp and search phases
Final result showing maximum sustainable concurrency
Detailed results saved to JSON file (default: engine_conn_results.json)

Dependencies

vLLM
transformers (optional, for accurate token counting)
asyncio (built-in)
argparse (built-in)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
.python-version		.python-version
1984.txt		1984.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
aggregate_parallel_results.py		aggregate_parallel_results.py
benchmark.py		benchmark.py
config.py		config.py
count.py		count.py
detect_gpus.py		detect_gpus.py
download_models.py		download_models.py
engine_manager.py		engine_manager.py
list_available_timestamps.py		list_available_timestamps.py
main.py		main.py
prompt_builder.py		prompt_builder.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM Engine-Direct Connection Ceiling Benchmark

Project Structure

Usage

Model Downloads

Download Qwen Models

Using Downloaded Models

Key Features

Common Engine Arguments

GPU Selection

Output

Dependencies

About

Uh oh!

Releases

Packages

Languages

License

futuritywork/vllm-benchmark

Folders and files

Latest commit

History

Repository files navigation

vLLM Engine-Direct Connection Ceiling Benchmark

Project Structure

Usage

Model Downloads

Download Qwen Models

Using Downloaded Models

Key Features

Common Engine Arguments

GPU Selection

Output

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages