Unify vllm and api-endpoint providers into a single provider for openai compatible APIs
#81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
Description
Currently http requests to the vllm server are sent out sequentially instead of in batches.
See #67 and #68 for reference.
I initially encountered this during the synthetic data & agents hackathon and solved this with a threadpool via
concurrent.futures.But since we are not actually doing any work (on the synthetic-data-kit side) in these threads, the solution from @HarshVaragiya using a more lightweight
asyncioapproach is much better - since we are essentially just spawning http requests and are "waiting in parallel".Benchmarks
OpenAI Batching (Provider=api-endpoint) OLD:
Single request latency (overall): count=10 min=18.235s median=35.391s mean=35.944s max=55.202s
Single request latency (short prompt): count=5 min=18.235s median=25.101s mean=24.447s max=29.872s
Single request latency (long prompt): count=5 min=40.909s median=47.113s mean=47.440s max=55.202s
Batch latency (overall, batch_size=8): count=10 min=3.706s median=5.348s mean=5.503s max=7.385s
Batch latency (short-only, batch_size=8): count=5 min=3.706s median=4.149s mean=4.667s max=6.908s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=4.671s median=6.634s mean=6.339s max=7.385s
-> 13m 40s
VLLM "Batching" (Provider=vllm) OLD:
Single request latency (overall): count=10 min=15.975s median=30.439s mean=31.472s max=50.811s
Single request latency (short prompt): count=5 min=15.975s median=18.694s mean=20.331s max=31.216s
Single request latency (long prompt): count=5 min=29.662s median=49.092s mean=42.613s max=50.811s
Batch latency (overall, batch_size=8): count=10 min=18.593s median=22.035s mean=21.966s max=25.405s
Batch latency (short-only, batch_size=8): count=5 min=18.593s median=21.603s mean=20.949s max=23.406s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=21.281s median=22.955s mean=22.984s max=25.405s
-> 34 min 50s
-> essentially as slow as processing sequentially (because it is)
VLLM Batching (Provider=vllm) NEW (HarshVaragiya) #68:
Single request latency (overall): count=10 min=13.990s median=31.602s mean=29.933s max=47.919s
Single request latency (short prompt): count=5 min=13.990s median=19.956s mean=21.075s max=31.802s
Single request latency (long prompt): count=5 min=31.403s median=38.369s mean=38.791s max=47.919s
Batch latency (overall, batch_size=8): count=10 min=3.025s median=4.228s mean=4.842s max=6.649s
Batch latency (short-only, batch_size=8): count=5 min=3.025s median=3.740s mean=3.635s max=4.367s
Batch latency (1 long prompt(s), batch_size=8): count=5 min=4.090s median=6.498s mean=6.050s max=6.649s
-> 11 min 56s
However since
vllm serveessentially exposes an openai-compatible API it makes sense to actually merge thevllmand
api-endpointprovider into a single one. So logic for this can be centralized and the synthetic-data-kit can be easily extended with new providers.Fixes # (issue)
#67
Type of change
Please non-releavant options