A benchmark to find the highest in-flight concurrency that a vLLM engine can sustain for long prompts while meeting SLA requirements.
The project has been refactored into multiple modules for better organization:
main.py- Main entry point that orchestrates the benchmarkconfig.py- Configuration management and command-line argument parsingprompt_builder.py- Utilities for constructing prompts of target token lengthengine_manager.py- Engine initialization and sampling parameter creationbenchmark.py- Core benchmarking functions (streaming, concurrency testing, ceiling finding)download_models.py- Utility to download models from Hugging Face to default cachedownload_qwen_models.py- Batch download script for Qwen models
python3 main.py \
--model meta-llama/Llama-3.1-8B \
--target-input-tokens 5000 \
--ttft-timeout 60 \
--hold-seconds 5 \
--sla-ok-rate 0.99The project includes utilities to download models from Hugging Face to the default cache location:
# Download both Qwen3-30B-A3B and Qwen3-30B-A3B-FP8 models
make download-qwen-models
# Download a specific model
make download-model MODEL=Qwen/Qwen3-30B-A3BOnce downloaded, you can use the models with their original Hugging Face names:
# Use downloaded models with their original names
python main.py --model Qwen/Qwen3-30B-A3B
python main.py --model Qwen/Qwen3-30B-A3B-FP8The models are stored in the default Hugging Face cache location and will be automatically found when referenced by their original model names.
- Direct Engine Connection: Avoids HTTP/gRPC overhead by driving vLLM's AsyncLLMEngine directly
- Concurrency Ceiling Detection: Uses exponential ramp + binary search to find maximum sustainable concurrency
- SLA Compliance: Ensures success rate and time-to-first-token (TTFT) thresholds are met
- Flexible Prompt Building: Supports both tokenizer-based and character-based prompt construction
- GPU Selection: Choose which GPU device to use via CUDA_VISIBLE_DEVICES environment variable
--dtype auto|float16|bfloat16--tensor-parallel-size 1--gpu-memory-utilization 0.9--max-model-len 8192--max-num-seqs 2048
GPU selection is controlled via the CUDA_VISIBLE_DEVICES environment variable:
# Use first GPU
CUDA_VISIBLE_DEVICES=0 python3 main.py --model your-model
# Use second GPU
CUDA_VISIBLE_DEVICES=1 python3 main.py --model your-model
# Use multiple GPUs (for tensor parallelism)
CUDA_VISIBLE_DEVICES=0,1 python3 main.py --model your-model --tensor-parallel-size 2The benchmark outputs:
- Real-time progress during ramp and search phases
- Final result showing maximum sustainable concurrency
- Detailed results saved to JSON file (default:
engine_conn_results.json)
- vLLM
- transformers (optional, for accurate token counting)
- asyncio (built-in)
- argparse (built-in)
Copyright (C) 2025 Alexander Ng