Skip to content

Conversation

@danipen
Copy link
Owner

@danipen danipen commented Dec 11, 2025

Summary

This PR introduces ReadOnlyMemory<char> support throughout the tokenization pipeline, enabling zero-allocation text handling. Combined with several allocation reduction optimizations, this delivers significant performance and memory improvements.

What's New

New LineText Type

A new LineText struct wraps ReadOnlyMemory<char>, providing a clean API for text handling without string allocations:

// Before: Always allocated strings
string lineText = model.GetLineText(lineIndex);
grammar.TokenizeLine(lineText, ruleStack, timeout);

// After: Zero-copy memory access
LineText lineText = model.GetLineText(lineIndex);
grammar.TokenizeLine(lineText, ruleStack, timeout);

Updated Public APIs

  • IGrammar.TokenizeLine() and TokenizeLine2() now accept LineText instead of string
  • IModelLines.GetLineText() returns LineText instead of string
  • Implicit conversions from string to LineText maintain backward compatibility

Benchmark Results

Tested with a 133,439 line C# file (5.8 MB):

Metric master This PR Improvement
Execution Time 4.752 s 2.681 s 44% faster
Memory Allocated 658.27 MB 496.36 MB 25% less
Gen0 Collections 82,000 62,000 24% fewer
Gen1 Collections 8,000 4,000 50% fewer

Optimizations Applied

  1. ArrayPool for line buffers - Reuse char arrays instead of allocating per line
  2. Allocation-free timing - Use Stopwatch.GetTimestamp() instead of new Stopwatch()
  3. List pooling - Reuse internal lists in hot paths (HandleCaptures, CheckWhileConditions)
  4. Scope name caching - Cache GetScopeNames() result to avoid repeated list creation
  5. Single-scope optimization - Avoid List<string> allocation for single scope pushes

⚠️ Breaking Changes

  • IModelLines.GetLineText() now returns LineText instead of string
  • IGrammar.TokenizeLine() signature changed to accept LineText

Migration: Replace string with LineText in implementations. The implicit conversion from string means most call sites work unchanged.

Testing

  • New benchmark project for performance validation

Line text can be implicitly converted from string and from ReadonlyMemory<char>
Optimizations applied:

- Use ArrayPool<char> instead of allocating new char[] per line in Grammar.Tokenize()
- Replace new Stopwatch() with Stopwatch.GetTimestamp() in LineTokenizer.Scan()
- Pool List<LocalStackElement> and List<WhileStack> in LineTokenizer
- Cache GetScopeNames() result in AttributedScopeStack
- Avoid List<string> allocation for single-scope PushAtributed()

Benchmark results (133K line file):
- Execution time: 4.75s → 2.89s (39% faster)
- Memory allocated: 658 MB → 488 MB (26% less)
- Gen0 collections: 82K → 61K (26% fewer)
- Gen1 collections: 8K → 4K (50% fewer)
…ilures

The ArrayPool<char> optimization in Grammar.Tokenize() caused test
failures on x64 Linux/Windows CI while passing on ARM64 macOS locally.

Root cause: The rented buffer was returned to the pool in the finally
block while LineTokens still held a ReadOnlyMemory<char> reference to
it. On x64 platforms with aggressive buffer reuse, subsequent tokenize
calls would reuse and overwrite the buffer, corrupting previous results.

The other performance optimizations from commit 0c1c0aa remain intact:
- Stopwatch.GetTimestamp() instead of new Stopwatch()
- Pooled List<LocalStackElement> and List<WhileStack> in LineTokenizer
- Cached GetScopeNames() in AttributedScopeStack
- Single-scope PushAtributed() optimization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants