-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomersinternationalizationInternational language supportInternational language supportpriority: mediumMedium priorityMedium priority
Milestone
Description
Summary
Allow configuring the minimum word length filter (currently hardcoded to 2) in tokenization.
Motivation
From classifier-reborn#176:
The current code filters words with length ≤ 2:
# lib/classifier/extensions/word_hash.rb
d[word.stem.intern] += 1 if !CORPUS_SKIP_WORDS.include?(word) && word.length > 2This assumption is problematic for:
- Chinese - single characters are meaningful words (e.g., 好 = good, 大 = big)
- Japanese - many common words are 1-2 characters
- Korean - similar situation
- Abbreviations - "AI", "ML", "US", "UK" are filtered out
- Technical domains - "Go" (programming language), "R" (statistics)
Proposed API
# Global configuration
Classifier.configure do |config|
config.min_word_length = 1 # default: 3
end
# Per-classifier configuration
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 1)
# Or disable the filter entirely
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 0)Workaround
Currently, users would need to monkey-patch String#word_hash_for_words or use a custom tokenizer (if #114 is implemented).
Related
- classifier-reborn#176: In some languages like Chinese, a word of length not bigger than 2 is very common, so I suppose this is a very strong(sometimes wrong in other languages) assumption. jekyll/classifier-reborn#176
- Potentially addressed by custom tokenizer support
Metadata
Metadata
Assignees
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomersinternationalizationInternational language supportInternational language supportpriority: mediumMedium priorityMedium priority