-
-
Notifications
You must be signed in to change notification settings - Fork 23.4k
fix: allow Unicode characters in document store tool names #5599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: allow Unicode characters in document store tool names #5599
Conversation
- Add sanitizeToolName helper function that preserves Unicode letters (Korean, Chinese, Japanese, etc.)
- Use Unicode-aware regex pattern [^\p{L}\p{N}_-] instead of [^a-z0-9_-]
- Add fallback name generation when result is empty
- Fixes 'Invalid tools[0].function.name: empty string' error for non-ASCII titles
Summary of ChangesHello @doubleclip118, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue where tool names containing non-ASCII characters would lead to validation errors. It introduces a robust sanitization utility that supports Unicode characters and ensures a valid tool name is always generated, significantly enhancing the internationalization capabilities and stability of tool name handling within the system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sanitizeToolName helper function to allow Unicode characters in tool names, addressing an issue where non-ASCII titles caused errors. The new function correctly uses a Unicode-aware regex and provides a fallback for empty names. The implementation is a significant improvement, centralizing the sanitization logic. My review includes a high-severity suggestion to prevent potential tool name collisions in the fallback mechanism and a medium-severity suggestion to simplify a conditional check for better code clarity.
|
|
||
| // If the result is empty, generate a fallback name | ||
| if (!sanitized || sanitized.trim() === '') { | ||
| return `tool_${Date.now()}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback name generation using Date.now() is not guaranteed to be unique. If this function is called multiple times within the same millisecond for different tools whose names both sanitize to an empty string, it will produce duplicate tool names. This can lead to unpredictable behavior, as the agent might use the wrong tool. To ensure uniqueness, consider appending a random component to the fallback name.
| return `tool_${Date.now()}` | |
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}` |
| .replace(/[^\p{L}\p{N}_-]/gu, '') // Allow Unicode letters and numbers | ||
|
|
||
| // If the result is empty, generate a fallback name | ||
| if (!sanitized || sanitized.trim() === '') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition !sanitized || sanitized.trim() === '' can be simplified. Given the preceding replacement operations (.replace(/ /g, '_') and .replace(/[^\\p{L}\\p{N}_-]/gu, '')), the sanitized string will not contain any leading or trailing whitespace. If the string becomes empty, !sanitized will be true. Therefore, the sanitized.trim() === '' check is redundant and can be removed for clarity.
| if (!sanitized || sanitized.trim() === '') { | |
| if (!sanitized) { |
- Simplify empty check by removing unnecessary trim() call - Add random suffix to fallback name for uniqueness guarantee - Prevents potential duplicate tool names in same millisecond
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sanitizeToolName helper function to handle Unicode characters in tool names, preventing errors when non-ASCII titles are used. The new function replaces duplicated sanitization logic throughout the Agent.ts file.
My review focuses on the implementation of sanitizeToolName. While the intent to support Unicode is good, the current implementation may cause issues with LLM APIs that have strict naming conventions for tools. I've provided suggestions to make the sanitization more robust and compatible with major models like OpenAI, Anthropic, and Gemini. I've also included recommendations for enforcing a character limit and improving the generation of unique fallback names.
| const sanitizeToolName = (name: string): string => { | ||
| const sanitized = name | ||
| .toLowerCase() | ||
| .replace(/ /g, '_') | ||
| .replace(/[^\p{L}\p{N}_-]/gu, '') // Allow Unicode letters and numbers | ||
|
|
||
| // If the result is empty, generate a fallback name with random suffix for uniqueness | ||
| if (!sanitized) { | ||
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}` | ||
| } | ||
|
|
||
| return sanitized | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While allowing Unicode characters in tool names is a good intention for internationalization, it might cause issues with the underlying LLM APIs. Most major models have strict restrictions on tool names:
- OpenAI & Anthropic:
^[a-zA-Z0-9_-]{1,64}$ - Google Gemini:
[a-zA-Z0-9_]+(max 63 chars, no hyphens)
The current regex /[^\p{L}\p{N}_-]/gu is too permissive and will allow characters that these models reject, potentially causing runtime errors.
A safer approach is to use a more restrictive regex that only allows characters supported by most models. The fallback mechanism you've added will correctly handle names that become empty after sanitization (like names composed entirely of non-ASCII characters).
Additionally, it's good practice to enforce a maximum length for tool names (e.g., 64 characters). I've included this in the suggestion below.
| const sanitizeToolName = (name: string): string => { | |
| const sanitized = name | |
| .toLowerCase() | |
| .replace(/ /g, '_') | |
| .replace(/[^\p{L}\p{N}_-]/gu, '') // Allow Unicode letters and numbers | |
| // If the result is empty, generate a fallback name with random suffix for uniqueness | |
| if (!sanitized) { | |
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}` | |
| } | |
| return sanitized | |
| } | |
| const sanitizeToolName = (name: string): string => { | |
| const sanitized = name | |
| .toLowerCase() | |
| .replace(/ /g, '_') | |
| .replace(/[^a-z0-9_-]/g, ''); // Restrict to characters allowed by major LLMs | |
| // If the result is empty, generate a fallback name with random suffix for uniqueness | |
| if (!sanitized) { | |
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}`; | |
| } | |
| // Enforce a 64-character limit, which is common for tool names | |
| return sanitized.slice(0, 64); | |
| } |
|
|
||
| // If the result is empty, generate a fallback name with random suffix for uniqueness | ||
| if (!sanitized) { | ||
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For generating unique identifiers, crypto.randomBytes is generally preferred over Math.random() as it provides cryptographically stronger random values, reducing the chance of collisions, however unlikely. Since this file uses CommonJS (module.exports), you can use require('crypto') inline.
| return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}` | |
| return `tool_${Date.now()}_${require('crypto').randomBytes(3).toString('hex')}`; |
- Add crypto import at the top of the file
- Replace inline require('crypto') with imported crypto module
- Restrict tool names to LLM-compatible characters [a-z0-9_-]
- Add 64 character limit for tool names
- Use crypto.randomBytes for stronger random fallback names
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sanitizeToolName helper function to allow Unicode characters in document store tool names, fixing an issue where non-ASCII titles would result in an invalid empty tool name. The new function also adds a fallback name generation mechanism and enforces a 64-character limit. The repeated sanitization logic has been correctly refactored to use this new centralized function.
My main feedback is that the regular expression in sanitizeToolName does not actually allow Unicode characters as intended by this pull request. I've provided a suggestion to correct the regex to align with the PR's objective.
| const sanitized = name | ||
| .toLowerCase() | ||
| .replace(/ /g, '_') | ||
| .replace(/[^a-z0-9_-]/g, '') // Restrict to characters allowed by major LLMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current regex /[^a-z0-9_-]/g does not align with the PR's goal to support Unicode characters in tool names, as it strips out any non-ASCII characters. This contradicts the PR description which states: 'Use Unicode-aware regex pattern [^\p{L}\p{N}_-] instead of [^a-z0-9_-]'.
To fix this and allow Unicode characters from languages like Korean, Chinese, and Japanese, the regex should be updated. The u flag is necessary for Unicode property escapes to work correctly.
I've also updated the comment to reflect the change, as the previous comment was contradictory to allowing Unicode characters.
| .replace(/[^a-z0-9_-]/g, '') // Restrict to characters allowed by major LLMs | |
| .replace(/[^\p{L}\p{N}_-]/gu, '') // Allow Unicode letters/numbers, underscore, and hyphen |
- Explicitly document ASCII-only restriction for LLM API compatibility - Clarify that non-ASCII titles use auto-generated fallback names - Mention the specific error being prevented
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sanitizeToolName function to handle non-ASCII characters in tool names, preventing errors with LLM APIs. The new function correctly sanitizes names to be ASCII-only and generates a unique fallback name if the sanitization results in an empty string. The duplicated sanitization logic across the file has been refactored to use this new centralized function, which is a great improvement.
My feedback includes a suggestion to make the fallback name generation more robust by relying solely on cryptographic randomness instead of a timestamp, further reducing the chance of collisions.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hi @HenryHengZJ and @vinodkiran, Thank you for your work on Flowise. This PR addresses an issue related to the HTTP node in Agentflow v2 where certain responses fail to be handled properly under specific conditions. It includes improved handling for streams and empty response bodies, and adds safety checks to avoid runtime errors in common workflows. Could you please take a look when you have time? Thank you! |
| @@ -1,3 +1,4 @@ | |||
| import crypto from 'crypto' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: import { randomBytes } from 'crypto'
|
|
||
| // If the result is empty (e.g., non-ASCII only input), generate a unique fallback name | ||
| if (!sanitized) { | ||
| return `tool_${crypto.randomBytes(8).toString('hex')}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed you already used Math.random() below. I think it should be enough for the random name.
return `tool_${Date.now()}_${Math.random().toString(36).slice(2, 7)}`;
This can avoid importing the crypto library just for generating a random name, since it doesn't require cryptographic security.
It seems that Gemini proposed to use crypto.randomBytes. I'm open to either option.
This PR fixes the 'Invalid tools[0].function.name: empty string' error that occurs with non-ASCII document store titles. Instead of allowing Unicode characters (which would cause LLM API errors), it sanitizes tool names to ASCII-only and generates unique fallback names for non-ASCII titles.