Skip to content

Support benchmarking of reasoning and tool calling Chat Completions requests #416

@bbrowning

Description

@bbrowning

Is your feature request related to a problem? Please describe.
I need to benchmark the performance of reasoning and tool call parsers in vLLM to understand the overall impact they have on time to time to first token, inter-token latency, inter-chunk latency (as in time between streaming chunks sent back to the client), and overall throughput of the vLLM server(s) so I can iterate and improve on the implementations of these in vLLM guided by quantitative performance numbers.

Describe the solution you'd like

I'd like us to ensure we can take inputs that express complex Chat Completions requests, that may include fields such as tools with a potentially large number of functions passed as well as fields like tool_choice to control auto/required/none/named function tool parsing. tools may need to be a new top-level input field we support while something like tool_choice may be fine to come in as an extra_body, since it's usually a simple string value. An example in our docs of how to do basic tool call benchmarking would help guide users in setting things up properly.

I'd like us to validate the metrics we emit today are accurate and not misleading when reasoning parser or tool call parsers are involved in the vLLM server-side.

For example, a tool call parser may buffer dozens of tokens server-side while waiting to determine if the response is generating a tool call or not. When does the clock start for time-to-first-token? When does it stop? Same for inter-token latency. The current OpenAI Chat Completions implementation has a bug here today in how it calculates ITL if tool call parsers are in use. We'll need to audit all our metrics carefully to ensure they don't make assumptions that are no longer valid in a world where the server is buffering, parsing, and changing what the client sees vs what the model generated.

What should we report when inter-token latency is say 20 ms, but inter-chunk latency (ie time between streaming chunks coming back to the client) is multiple seconds or even minutes? Do we need to start tracking a new metric that represents the time between streaming tokens back to the client, which is distinct and separate from the ITL? How would we optimize tool call parsers to lower inter-chunk latency objectively, backed by data?

Describe alternatives you've considered
I have not run across another project so far that can accurately measure inter-token latency when tool call parsing is in use. I've only really dug into the code of vllm bench serve, llm-load-test, and guidellm. But, am open to considering other projects if they're known to work for more advanced reasoning and tool calling cases.

I've also considered creating my own project, as GuideLLM today appears more focused on raw prompts/responses via the Completions API instead of newer APIs like Chat Completions or Responses. However, there are recent changes to make GuideLLM more pluggable so decided to start here first.

Additional context

Here's a simple example of the kinds of streaming chunks we get back with a tool call parser enabled, with most of the chunks in the middle of the stream cut out for brevity. Note that the first chunk with non-empty content the client sees comes after the model has generated 17 completion tokens. This is a basic example from a Granite model, and this number can be far higher for models that reason and/or follow structured formats like the gpt-oss Harmony format. This can easily balloon to many dozens of tokens generated before we send the first chunk back to the client.

{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":371,"completion_tokens":0}}

{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"id":"chatcmpl-tool-1417795059744796902a5cf411ab9ab6","type":"function","index":0,"function":{"name":"get_current_weather","arguments":"{"}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":388,"completion_tokens":17}}

{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":null,"arguments":"\n "}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":389,"completion_tokens":18}}

...

{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":null,"arguments":"\n"}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":409,"completion_tokens":38}}

{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[],"usage":{"prompt_tokens":371,"total_tokens":411,"completion_tokens":40}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinternalfiled by core contributor or associate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions