Configurable minimum word length for tokenization

## Summary
Allow configuring the minimum word length filter (currently hardcoded to 2) in tokenization.

## Motivation
From [classifier-reborn#176](https://github.com/jekyll/classifier-reborn/issues/176):

The current code filters words with length ≤ 2:
```ruby
# lib/classifier/extensions/word_hash.rb
d[word.stem.intern] += 1 if !CORPUS_SKIP_WORDS.include?(word) && word.length > 2
```

This assumption is problematic for:
- **Chinese** - single characters are meaningful words (e.g., 好 = good, 大 = big)
- **Japanese** - many common words are 1-2 characters
- **Korean** - similar situation
- **Abbreviations** - "AI", "ML", "US", "UK" are filtered out
- **Technical domains** - "Go" (programming language), "R" (statistics)

## Proposed API

```ruby
# Global configuration
Classifier.configure do |config|
  config.min_word_length = 1  # default: 3
end

# Per-classifier configuration
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 1)

# Or disable the filter entirely
classifier = Classifier::Bayes.new('Spam', 'Ham', min_word_length: 0)
```

## Workaround
Currently, users would need to monkey-patch `String#word_hash_for_words` or use a custom tokenizer (if #114 is implemented).

## Related
- classifier-reborn#176: https://github.com/jekyll/classifier-reborn/issues/176
- Potentially addressed by custom tokenizer support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Configurable minimum word length for tokenization #120

Summary

Motivation

Proposed API

Workaround

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Configurable minimum word length for tokenization #120

Description

Summary

Motivation

Proposed API

Workaround

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions