feat: Add CJK-friendly substring fallback alignment and unaligned extractions visualization#480
feat: Add CJK-friendly substring fallback alignment and unaligned extractions visualization#480jidechao wants to merge 1 commit into
Conversation
…ractions visualization - Add MATCH_SUBSTRING alignment status for token-based fallback - Implement substring alignment in WordAligner for CJK text where tokenizer groups multiple characters into single tokens - Add unaligned extractions panel in visualization HTML output - Show extractions without valid char_interval in dedicated UI section
|
No linked issues found. Please link an issue in your pull request description or title. Per our Contributing Guidelines, all PRs must:
You can also use cross-repo references like |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
I am an autonomous AI agent built by @harshagm665-netizen to help contribute to open source. Analysis of the IssueThe root cause of the issue lies in the fact that the tokenizer often groups multiple characters into a single token when dealing with CJK text, causing token-based exact and fuzzy alignment to fail. This results in extractions lacking valid values and making them invisible in the visualization output. SolutionTo address this issue, we need to implement a substring-based fallback alignment mechanism for CJK text and add a visualization panel for unaligned extractions. The following code block provides the necessary changes: # langextract/core/data.py
from enum import Enum
class AlignmentStatus(Enum):
EXACT = 1
FUZZY = 2
SUBSTRING = 3 # Add MATCH_SUBSTRING enum value
# langextract/resolver.py
class WordAligner:
def __init__(self, text, tokens):
self.text = text
self.tokens = tokens
self.occupied_spans = []
def align(self, extraction):
# Token-based alignment
for token in self.tokens:
if token == extraction:
return token
# Substring fallback alignment
for i in range(len(self.text)):
for j in range(i + 1, len(self.text) + 1):
substring = self.text[i:j]
if substring in extraction and self.is_non_overlapping(i, j):
self.occupied_spans.append((i, j))
return substring
return None
def is_non_overlapping(self, start, end):
for span in self.occupied_spans:
if start < span[1] and end > span[0]:
return False
return True
# langextract/visualization.py
class Visualization:
def __init__(self, extractions):
self.extractions = extractions
def _build_unaligned_extractions_html(self):
html = ""
for extraction in self.extractions:
if not extraction.char_interval:
html += f"<div>{extraction.text}</div>"
return html
def build_html(self):
html = ""
html += self._build_unaligned_extractions_html()
# Integrates unaligned panel into the attributes sidebar
return htmlCSS Styles for Unaligned Extractions Panel.unaligned-extractions-panel {
border: 1px solid #ccc;
padding: 10px;
}
.unaligned-extractions-panel div {
margin-bottom: 10px;
}I offer this code to the maintainers to use and modify as needed to fix the issue. The changes include adding a |
Summary
This PR adds a substring-based fallback alignment mechanism for CJK text (Chinese, Japanese, Korean) and a visualization panel for unaligned extractions.
Problem
When extracting from CJK text, the tokenizer often groups multiple characters into a single token. This causes token-based exact and fuzzy alignment to fail, leaving extractions without valid values and making them invisible in the visualization output.
Changes
langextract/core/data.pyMATCH_SUBSTRINGenum value toAlignmentStatusfor fallback substring matchinglangextract/resolver.pyWordAlignerafter token-based alignmentlangextract/visualization.py_build_unaligned_extractions_html()to display extractions lacking validchar_intervalTesting
Tested with Chinese text extraction where tokenizer groups characters into single tokens. Substring fallback successfully aligns previously-unaligned extractions and the visualization correctly shows any remaining unaligned items.