Skip to content

fix: split CJK from Latin regex tokens#479

Open
cat0825 wants to merge 3 commits into
google:mainfrom
cat0825:fix-non-ascii-char-intervals
Open

fix: split CJK from Latin regex tokens#479
cat0825 wants to merge 3 commits into
google:mainfrom
cat0825:fix-non-ascii-char-intervals

Conversation

@cat0825

@cat0825 cat0825 commented May 31, 2026

Copy link
Copy Markdown

Description

The default regex tokenizer merged CJK characters with adjacent Latin words into a single token, producing token boundaries that did not align with the source character intervals. This change splits CJK runs from Latin runs so each is emitted as its own word token while preserving exact source-character offsets.

  • Prevent the default regex tokenizer from merging CJK text with adjacent Latin words.
  • Keep CJK spans classified as word tokens while preserving exact source character intervals.
  • Add regression coverage for Latin/CJK and accented-Latin/CJK boundaries.

Fixes #334

Bug fix

How Has This Been Tested?

$ python -m pytest tests/tokenizer_test.py -q
$ python -m pytest tests/chunking_test.py tests/resolver_test.py -q
$ python -m pyink langextract/core/tokenizer.py tests/tokenizer_test.py
$ git diff --check

Also ran python -m pytest -q. Collection is blocked in this local environment because google.genai is not importable for the Gemini provider tests (tests/annotation_test.py, tests/gemini_retry_test.py, tests/inference_test.py, tests/provider_schema_test.py); those are unrelated to this change.

Checklist:

  • I have read and acknowledged Google's Open Source Code of conduct.
  • I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
  • I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
  • I have added tests, or I have ensured existing tests cover the changes.
  • I have followed Google's Python Style Guide and ran pylint over the affected code.

@github-actions github-actions Bot added the size/S Pull request with 50-150 lines changed label May 31, 2026
@google-cla

google-cla Bot commented May 31, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Pull request with 50-150 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1

1 participant