Skip to content

Add custom tokenizer support #118

@cardmagic

Description

@cardmagic

Summary

Allow users to provide a custom tokenizer for text processing instead of using the built-in String#word_hash method.

Motivation

From classifier-reborn#131:

The current tokenization is hardcoded:

str.gsub(/[^\w\s]/, '').downcase.split

This doesn't work well for:

  • CJK languages (Chinese, Japanese, Korean) - require specialized tokenizers like TinySegmenter
  • N-gram based classification - phrases like "New York" get split and "New" may be filtered as a stopword
  • Domain-specific text - medical, legal, or technical text may need custom tokenization rules

Proposed API

# Lambda-based tokenizer
classifier = Classifier::Bayes.new('Spam', 'Ham', 
  tokenizer: ->(text) { MySegmenter.segment(text) }
)

# Or a tokenizer class
class JapaneseTokenizer
  def tokenize(text)
    TinySegmenter.new.segment(text)
  end
end

classifier = Classifier::Bayes.new('Spam', 'Ham',
  tokenizer: JapaneseTokenizer.new
)

Affected Classes

  • Classifier::Bayes
  • Classifier::LSI
  • Classifier::TFIDF
  • Classifier::LogisticRegression

Related

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions