Skip to content

Conversation

@joshweir
Copy link

Fix #6

Created new splittable PRE_N_POST_ONLY which holds characters which can be both prefixes and suffixes but are only a splittable if at the beginning or end of a token with the exception of being prefixed/suffixed by other splittables.
Taking the single quote ' as a PRE_N_POST_ONLY splittable, the following would be valid use cases as a splittable:

  • 'test quotes'
  • 'test quotes'. <- suffixed by another splittable
  • ('test quotes'). <- prefixed and suffixed by another splittable

The following would not be valid uses as a splittable:

  • l'interrelation
  • l'imagerie

…eg. https://www.google.com, google.com, etc

fix hardcoding of tokenizer path in test_tokenize_urls test

refactor tokenizer

fix bug when url contains directories the entire url would not be a single token

refactor lib/tokenizer
# The first commit's message is:
recognize a complete url as a token, this includes various url forms eg. https://www.google.com, google.com, etc

# This is the 2nd commit message:

fix hardcoding of tokenizer path in test_tokenize_urls test

# This is the 3rd commit message:

refactor tokenizer

# This is the 4th commit message:

fix bug when url contains directories the entire url would not be a single token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

french words that contains single quote get broken down

1 participant