-
Notifications
You must be signed in to change notification settings - Fork 168
improvement: remove redundant lowercase and punctuation checks #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The _stem method performs unnecessary checks that are already handled upstream: - Punctuation check: remove_non_alphanumeric() already removes all punctuation - Lowercase conversion: SimpleTokenizer.tokenize() already converts to lowercase These redundant operations can be safely removed without affecting functionality.
📝 WalkthroughWalkthroughThe ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
fastembed/sparse/bm25.py (1)
244-244: Fix variable name bug that causes NameError.The variable
tokenis undefined in this scope. It should belower_tokento match the loop variable.- if len(token) > self.token_max_length: + if len(lower_token) > self.token_max_length:
🧹 Nitpick comments (1)
fastembed/sparse/bm25.py (1)
127-127: Remove unused punctuation set initialization.Since the punctuation checks have been removed from
_stemmethod, theself.punctuationattribute is no longer used and can be removed.- self.punctuation = set(get_all_punctuation())You may also want to remove the
get_all_punctuationimport on line 13 if it's not used elsewhere.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/sparse/bm25.py(1 hunks)
🔇 Additional comments (1)
fastembed/sparse/bm25.py (1)
238-251: Good optimization by removing redundant operations.The removal of redundant lowercase conversion and punctuation filtering is a solid improvement, as these operations are already handled by
remove_non_alphanumeric()andSimpleTokenizer.tokenize()upstream.
|
@coderabbitai Wait, I might've found a small bug. At least in typescript my regex does not filter out |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@dejoma You're absolutely right to be concerned about this! Let me investigate the regex pattern being used. Based on my analysis of the code, I can see that the
From the test I ran, if the regex uses You mentioned that in TypeScript your regex doesn't filter out underscores, which suggests your TypeScript version might be using a pattern that includes underscores (like This could indeed be a bug if the expectation is that underscores should be filtered out consistently across implementations. Could you share what the expected behavior should be for underscores in this context? Should they be removed as punctuation or kept as valid characters? (\ /) 🏁 Script executed: Length of output: 12112 🏁 Script executed: Length of output: 209 🏁 Script executed: Length of output: 917 🏁 Script executed: Length of output: 1301 🏁 Script executed: Length of output: 4223 🏁 Script executed: Length of output: 1526 @dejoma Great catch! You're absolutely right to be concerned about this inconsistency. I found the issue! The def remove_non_alphanumeric(text: str) -> str:
return re.sub(r"[^\w\s]", " ", text, flags=re.UNICODE)The pattern
So underscores are NOT being filtered out in the current Python implementation. My test confirms this:
This creates an inconsistency with your TypeScript implementation where underscores are filtered out. If the expected behavior is to treat underscores as punctuation that should be removed, then the regex should be changed to: def remove_non_alphanumeric(text: str) -> str:
return re.sub(r"[^a-zA-Z0-9\s]", " ", text, flags=re.UNICODE)This would filter out underscores along with other punctuation, making it consistent with your TypeScript implementation. Should I help you create an issue to track this bug, or would you like to address it in this current PR? (\ /) |
The _stem method performs unnecessary checks that are already handled upstream:
These redundant operations can be safely removed without affecting functionality.
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?New models submission: