fix: resilient DSML tool-call parsing and repair in long contexts#192
Open
jackygurui wants to merge 4 commits into
Open
fix: resilient DSML tool-call parsing and repair in long contexts#192jackygurui wants to merge 4 commits into
jackygurui wants to merge 4 commits into
Conversation
During long tool-call generations (2000+ tokens), the model's attention degrades and drops closing DSML tags before reaching max_tokens. This causes finish=error with 'unterminated tool call', aborting the turn. Fix: before returning error, attempt to repair by appending missing closing tags (parameter -> invoke -> tool_calls in nesting order), then re-parse to verify the repair produces valid tool calls. - Add try_repair_dsml() to detect and fix unclosed DSML blocks - Integrate repair at the unterminated tool call error path - Add test_dsml_repair_produces_parseable_calls() with 7 scenarios covering all three DSML styles and multiple truncation patterns - Tests verify structural accuracy: tool name and arguments are correct Results: 0 finish=error across 156+ requests, 100% repair success rate on unterminated tool calls.
Long-context generations produce malformed DSML that parse_generated_message cannot parse, causing "invalid tool call" and breaking the agent loop. Three failure modes observed in stress testing (256K, q4-imatrix): Mode 1 (unterminated): model stops mid-DSML, missing closing tags Mode 2 (malformed closed): outer tags balanced but inner tags broken Mode 3 (hallucinated): tool_calls tags wrap plain reasoning text This commit addresses modes 1 and 2 via try_repair_dsml(): single-pass tag counting (O(n)) followed by appending missing closing tags in reverse nesting order (parameter -> invoke -> tool_calls). Also adds unit tests. Mode 3 is handled by antirez's commit 037ee39 which prevents DSML inside thinking from being detected as executable tool calls. Also adds orphan end tag guard: when toe>tos or ioe>ios or poe>pos, the size_t subtraction would underflow. Return false early. Signed-off-by: Rui Gu <jackygurui@gmail.com>
When parse_generated_message_ex is called with require_thinking_closed=true and the model never outputs </thinking>, the entire generation is treated as reasoning and any DSML inside is silently ignored. This stderr log makes the gate visible for debugging. Refs: antirez#167, commit 037ee39 (Ignore tool calls emitted inside thinking) Signed-off-by: Rui Gu <jackygurui@gmail.com>
try_repair_dsml scanned the full generated text for DSML tags. When the model discusses DSML syntax in its reasoning (e.g. explaining the DSML tags), those text mentions inflate the tag counts, causing false positive repairs (appending unnecessary closing tags). Fix: find the last </thinking> boundary and start counting only from there. DSML mentioned inside reasoning is model text, not executable tool calls — matches the same approach used by parse_generated_message_ex (commit 037ee39). Also updated the hallucinated strip path to copy the thinking section verbatim and only strip from the post-thinking region. Real-world validation: observed this exact false positive in production. The model was explaining how try_repair_dsml works and quoted the DSML tag syntax in its explanation. The parser mistook the quote for a real tool call and the tag counting inflated, causing a failed repair. Metrics from production use (after all fixes in this branch): Tool calls: 169 | Invalid: 2 | Repaired: 38 | Orphan: 3 Only 2 cases remain unrecoverable; the other 41 (38 repaired + 3 orphan) are now gracefully handled instead of causing finish=error. Signed-off-by: Rui Gu <jackygurui@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Long-context and long-generation scenarios expose several failure modes in DSML tool-call parsing. The model's attention degrades over thousands of tokens, producing malformed DSML that causes
finish=errorand breaks the agent loop. This PR adds a robust repair layer (try_repair_dsml) that handles all observed failure modes, plus targeted fixes for false positives and visibility improvements.Problem
Three failure modes were observed in stress testing (256K context, q4-imatrix):
finish=error, "unterminated tool call"finish=error, "invalid tool call"<tool_calls>...</tool_calls>wrapping plain reasoning textBefore this fix, all three modes would land in
finish=error, silently aborting the agent loop. Additionally, DSML tags mentioned inside<thinking>text (e.g. the model explaining tool syntax in its reasoning) were being counted by the tag scanner intry_repair_dsml, causing false positive repairs.Changes
try_repair_dsml()— single-pass DSML repair (2 commits)parameter → invoke → tool_calls)tool_callstags exist but contain no<invoke>, strips the orphaned tags so the text is treated as plain contentsize_tunderflow)Ignore DSML inside
<thinking>(1 commit)try_repair_dsmlnow finds the last</thinking>boundary and scans for DSML tags only from there — deliberately matching the same strategy used byparse_generated_message_ex, which was introduced in commit037ee39("Ignore tool calls emitted inside thinking") to fix the same class of false positives reported in #167Debug visibility (1 commit)
stderrmessage whenrequire_thinking_closed=trueis triggered but the model never closes<thinking>Results
Before this fix, the 38 repaired + 3 orphan + 2 invalid cases would all have been counted as invalid, with every occurrence causing a hard
finish=error. After the fix:Only 2 cases remain unrecoverable; the other 41 are gracefully handled (38 repaired, 3 hallucinated-tags stripped). 100% repair success rate on unterminated tool calls.