-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestinternationalizationInternational language supportInternational language supportpriority: highHigh priorityHigh priority
Milestone
Description
Summary
Allow users to provide a custom tokenizer for text processing instead of using the built-in String#word_hash method.
Motivation
From classifier-reborn#131:
The current tokenization is hardcoded:
str.gsub(/[^\w\s]/, '').downcase.splitThis doesn't work well for:
- CJK languages (Chinese, Japanese, Korean) - require specialized tokenizers like TinySegmenter
- N-gram based classification - phrases like "New York" get split and "New" may be filtered as a stopword
- Domain-specific text - medical, legal, or technical text may need custom tokenization rules
Proposed API
# Lambda-based tokenizer
classifier = Classifier::Bayes.new('Spam', 'Ham',
tokenizer: ->(text) { MySegmenter.segment(text) }
)
# Or a tokenizer class
class JapaneseTokenizer
def tokenize(text)
TinySegmenter.new.segment(text)
end
end
classifier = Classifier::Bayes.new('Spam', 'Ham',
tokenizer: JapaneseTokenizer.new
)Affected Classes
Classifier::BayesClassifier::LSIClassifier::TFIDFClassifier::LogisticRegression
Related
- classifier-reborn#131: Ability to specify custom tokenizer jekyll/classifier-reborn#131
- classifier-reborn#176: In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. jekyll/classifier-reborn#176 (Chinese word length)
Metadata
Metadata
Assignees
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestinternationalizationInternational language supportInternational language supportpriority: highHigh priorityHigh priority