-
Notifications
You must be signed in to change notification settings - Fork 598
[VLM] Fixing request timeout error, and enable VllmDeployer to fail fast if the underying vllm serve process already failed
#2409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
johncalesp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
Victor49152
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified for a few runs and works good
…rf-inference into wangshangsam/fix-req-timeout
… need to download the model for the fisrt time.
|
This PR is ready. @mrmhodak @hanyunfan Could you help to approve and merge this PR? |
…rf-inference into wangshangsam/fix-req-timeout
…condhand f1 score
…rf-inference into wangshangsam/fix-req-timeout
…rf-inference into wangshangsam/fix-req-timeout
…rf-inference into wangshangsam/fix-req-timeout
…rf-inference into wangshangsam/fix-req-timeout
|
@nvzhihanj do you wanna take another look? If it LGTY, we can merge it. |
This PR fixes two issues:
DefaultAioHttpClientinAsyncOpenAI, it turns out thatDefaultAioHttpClient's HTTP request timeout setting would override theAsyncOpenAI's chat completion request timeout setting, thereby, I also need to set the timeout setting inDefaultAioHttpClientas well.VllmDeployer, it would still wait 20 mins (or whatever the timeout is) to fail. Now it can fail fast as soon as the underyingvllm serveprocess fail.In addition, this PR also:
3. Introduce example slurm scripts that demonstrates how to run the benchmark in a GPU cluster managed by slurm.
4. Change the default
server_expected_qpsandserver_target_latencyto the value that we are proposing: