Skip to content

Conversation

@BenjaminBruenau
Copy link

@BenjaminBruenau BenjaminBruenau commented Oct 31, 2025

Pull Request

Description

Currently http requests to the vllm server are sent out sequentially instead of in batches.
See #67 and #68 for reference.

I initially encountered this during the synthetic data & agents hackathon and solved this with a threadpool via concurrent.futures.
But since we are not actually doing any work (on the synthetic-data-kit side) in these threads, the solution from @HarshVaragiya using a more lightweight asyncio approach is much better - since we are essentially just spawning http requests and are "waiting in parallel".

Benchmarks

OpenAI Batching (Provider=api-endpoint) OLD:
Single request latency (overall): count=10 min=18.235s median=35.391s mean=35.944s max=55.202s
Single request latency (short prompt): count=5 min=18.235s median=25.101s mean=24.447s max=29.872s
Single request latency (long prompt): count=5 min=40.909s median=47.113s mean=47.440s max=55.202s
Batch latency (overall, batch_size=8): count=10 min=3.706s median=5.348s mean=5.503s max=7.385s
Batch latency (short-only, batch_size=8): count=5 min=3.706s median=4.149s mean=4.667s max=6.908s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=4.671s median=6.634s mean=6.339s max=7.385s
-> 13m 40s

VLLM "Batching" (Provider=vllm) OLD:
Single request latency (overall): count=10 min=15.975s median=30.439s mean=31.472s max=50.811s
Single request latency (short prompt): count=5 min=15.975s median=18.694s mean=20.331s max=31.216s
Single request latency (long prompt): count=5 min=29.662s median=49.092s mean=42.613s max=50.811s
Batch latency (overall, batch_size=8): count=10 min=18.593s median=22.035s mean=21.966s max=25.405s
Batch latency (short-only, batch_size=8): count=5 min=18.593s median=21.603s mean=20.949s max=23.406s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=21.281s median=22.955s mean=22.984s max=25.405s
-> 34 min 50s
-> essentially as slow as processing sequentially (because it is)

VLLM Batching (Provider=vllm) NEW (HarshVaragiya) #68:
Single request latency (overall): count=10 min=13.990s median=31.602s mean=29.933s max=47.919s
Single request latency (short prompt): count=5 min=13.990s median=19.956s mean=21.075s max=31.802s
Single request latency (long prompt): count=5 min=31.403s median=38.369s mean=38.791s max=47.919s
Batch latency (overall, batch_size=8): count=10 min=3.025s median=4.228s mean=4.842s max=6.649s
Batch latency (short-only, batch_size=8): count=5 min=3.025s median=3.740s mean=3.635s max=4.367s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=4.090s median=6.498s mean=6.050s max=6.649s
-> 11 min 56s

However since vllm serve essentially exposes an openai-compatible API it makes sense to actually merge the vllm
and api-endpoint provider into a single one. So logic for this can be centralized and the synthetic-data-kit can be easily extended with new providers.

Fixes # (issue)

#67

Type of change

Please non-releavant options

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (kinda breaking since the config is updated, but it is made backwards compatible with api-endpoint and vllm providers for now)
  • Documentation update

HarshVaragiya and others added 4 commits September 5, 2025 21:31
…provider

- Fix VLLM provider processing batches sequentially (based on Harsh Varagiyas fix for the sequential vllm issue)
- single unified provider for accessing local inference engines via their apis (i.e. vllm serve or llama-cpp's llama-server)
- add backwards compatibility with old configs using api-endpoint and vllm
- removes event loop is closed warnings when processing batch requests
- add error handling for invalid lance dataset path
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants