Support benchmarking of reasoning and tool calling Chat Completions requests

**Is your feature request related to a problem? Please describe.**
I need to benchmark the performance of reasoning and tool call parsers in vLLM to understand the overall impact they have on time to time to first token, inter-token latency, inter-chunk latency (as in time between streaming chunks sent back to the client), and overall throughput of the vLLM server(s) so I can iterate and improve on the implementations of these in vLLM guided by quantitative performance numbers.

**Describe the solution you'd like**

I'd like us to ensure we can take inputs that express complex Chat Completions requests, that may include fields such as `tools` with a potentially large number of functions passed as well as fields like `tool_choice` to control auto/required/none/named function tool parsing. `tools` may need to be a new top-level input field we support while something like `tool_choice` may be fine to come in as an `extra_body`, since it's usually a simple string value. An example in our docs of how to do basic tool call benchmarking would help guide users in setting things up properly.

I'd like us to validate the metrics we emit today are accurate and not misleading when reasoning parser or tool call parsers are involved in the vLLM server-side.

For example, a tool call parser may buffer dozens of tokens server-side while waiting to determine if the response is generating a tool call or not. When does the clock start for time-to-first-token? When does it stop? Same for inter-token latency. The current OpenAI Chat Completions implementation has a bug here today in how it calculates ITL if tool call parsers are in use. We'll need to audit all our metrics carefully to ensure they don't make assumptions that are no longer valid in a world where the server is buffering, parsing, and changing what the client sees vs what the model generated.

What should we report when inter-token latency is say 20 ms, but inter-chunk latency (ie time between streaming chunks coming back to the client) is multiple seconds or even minutes? Do we need to start tracking a new metric that represents the time between streaming tokens back to the client, which is distinct and separate from the ITL? How would we optimize tool call parsers to lower inter-chunk latency objectively, backed by data?

**Describe alternatives you've considered**
I have not run across another project so far that can accurately measure inter-token latency when tool call parsing is in use. I've only really dug into the code of `vllm bench serve`, `llm-load-test`, and `guidellm`. But, am open to considering other projects if they're known to work for more advanced reasoning and tool calling cases.

I've also considered creating my own project, as GuideLLM today appears more focused on raw prompts/responses via the Completions API instead of newer APIs like Chat Completions or Responses. However, there are recent changes to make GuideLLM more pluggable so decided to start here first.

**Additional context**

Here's a simple example of the kinds of streaming chunks we get back with a tool call parser enabled, with most of the chunks in the middle of the stream cut out for brevity. Note that the first chunk with non-empty content the client sees comes after the model has generated 17 completion tokens. This is a basic example from a Granite model, and this number can be far higher for models that reason and/or follow structured formats like the gpt-oss Harmony format. This can easily balloon to many dozens of tokens generated before we send the first chunk back to the client.

`{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":371,"completion_tokens":0}}`

`{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"id":"chatcmpl-tool-1417795059744796902a5cf411ab9ab6","type":"function","index":0,"function":{"name":"get_current_weather","arguments":"{"}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":388,"completion_tokens":17}}`

`{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":null,"arguments":"\n      "}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":389,"completion_tokens":18}}`

...

`{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"name":null,"arguments":"\n"}}]},"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":371,"total_tokens":409,"completion_tokens":38}}`

`{"id":"chatcmpl-benchmark-serving0","object":"chat.completion.chunk","created":1760715994,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[],"usage":{"prompt_tokens":371,"total_tokens":411,"completion_tokens":40}}`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support benchmarking of reasoning and tool calling Chat Completions requests #416

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support benchmarking of reasoning and tool calling Chat Completions requests #416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions