Skip to content

Add Lark.scan() for finding grammar matches in text, and parsing them.#1592

Open
erezsh wants to merge 11 commits into
masterfrom
scan_parse
Open

Add Lark.scan() for finding grammar matches in text, and parsing them.#1592
erezsh wants to merge 11 commits into
masterfrom
scan_parse

Conversation

@erezsh

@erezsh erezsh commented May 10, 2026

Copy link
Copy Markdown
Member

It finds and parses each non-overlapping match, yielding the longest possible match.

Reimplementation of the long-standing #1429 PR by @MegaIng on top of the merged TextSlice support, with some modifications.

Works in 2 steps. First parses without callbacks, for cheap cloning. Then finally replays the chosen match with user callbacks.

@codecov

codecov Bot commented May 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.78049% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.41%. Comparing base (c169b26) to head (0978b72).

Files with missing lines Patch % Lines
lark/parser_frontends.py 96.82% 2 Missing ⚠️
lark/lexer.py 97.91% 1 Missing ⚠️
tests/test_scan.py 99.47% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1592      +/-   ##
==========================================
+ Coverage   90.08%   90.41%   +0.33%     
==========================================
  Files          52       53       +1     
  Lines        8105     8422     +317     
==========================================
+ Hits         7301     7615     +314     
- Misses        804      807       +3     
Flag Coverage Δ
unittests 90.41% <98.78%> (+0.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@erezsh erezsh requested a review from MegaIng May 10, 2026 14:29

@MegaIng MegaIng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, thank you for taking over this implementation, sorry for never getting around to it.

I genuinely don't have any detailed comments that aren't unrelated nits, this is very clean.

@erezsh

erezsh commented May 10, 2026

Copy link
Copy Markdown
Member Author

No worries, thanks for looking it over!

erezsh added 3 commits June 25, 2026 17:11
…uce__

Both methods only copied start_pos/line/column, silently dropping the end
positions.
It finds and parses each non-overlapping match, yielding the longest possible match.

Reimplementation of the long-standing #1429 PR on top of the merged
TextSlice support, with some modifications.
erezsh added 6 commits June 25, 2026 23:01
- Split ParsingFrontend.scan() into _scan() so configuration errors raise on the call
- Lark.scan() now checks parser='lalr' up front too. Reject custom lexer
- Re-raise ConfigurationError instead of counting it as ValueError
- Check source positions on every replayed token, not just the last
docstring.
- Improves docs
- Adds tests
@erezsh

erezsh commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

Added a few fixes.

I think it's now ready to merge.

Most important ones are the scan-related ones. Especially the last two: they fix performance issues, and a small bug around %ignore.

@MegaIng You're welcome to review the changes if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants