Skip to content

Configurable minimum word length for tokenization #120

@cardmagic

Description

@cardmagic

Summary

Allow configuring the minimum word length filter (currently hardcoded to 2) in tokenization.

Motivation

From classifier-reborn#176:

The current code filters words with length ≤ 2:

# lib/classifier/extensions/word_hash.rb
d[word.stem.intern] += 1 if !CORPUS_SKIP_WORDS.include?(word) && word.length > 2

This assumption is problematic for:

  • Chinese - single characters are meaningful words (e.g., 好 = good, 大 = big)
  • Japanese - many common words are 1-2 characters
  • Korean - similar situation
  • Abbreviations - "AI", "ML", "US", "UK" are filtered out
  • Technical domains - "Go" (programming language), "R" (statistics)

Proposed API

# Global configuration
Classifier.configure do |config|
  config.min_word_length = 1  # default: 3
end

# Per-classifier configuration
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 1)

# Or disable the filter entirely
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 0)

Workaround

Currently, users would need to monkey-patch String#word_hash_for_words or use a custom tokenizer (if #114 is implemented).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions