Reject Unicode whitespace characters in SQL identifiers#38
Draft
NikolayS wants to merge 1 commit into
Draft
Conversation
The flex scanner's identifier rules use byte ranges (\200-\377) to match non-ASCII characters, but this range also matches bytes that form multi-byte Unicode whitespace characters such as NO-BREAK SPACE (U+00A0). This creates a "Trojan Source" vulnerability where queries that appear normal under visual inspection parse with entirely different semantics. For example, a NO-BREAK SPACE between "is" and "null" causes the scanner to produce a single identifier token "is<NBSP>null" instead of the keywords IS NULL, which can silently turn a NULL check into a column alias, leaking sensitive data. Add check_ident_for_unicode_whitespace() which rejects any unquoted identifier containing a character with the Unicode White_Space property. The check uses PostgreSQL's existing unicode_category infrastructure and only applies to UTF-8 encoded databases. All 18 Unicode whitespace characters (U+00A0, U+1680, U+2000-U+200A, U+2028, U+2029, U+202F, U+205F, U+3000) are covered. Reported-by: Maxim Boguk https://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
check_ident_for_unicode_whitespace()in the scanner support code that rejects any unquoted identifier containing a UnicodeWhite_Spaceproperty character, using PostgreSQL's existingunicode_categoryinfrastructureThe vulnerability
The flex scanner's
ident_start/ident_contrules use\200-\377byte ranges, which match all non-ASCII bytes — including bytes that form multi-byte Unicode whitespace. For example:The NO-BREAK SPACE (U+00A0) between
isandnullis invisible in terminals, editors, and GitHub. The scanner produces one identifier tokenis<NBSP>nullinstead of keywordsIS NULL, turning a NULL check into a column alias.This also enables poisoned views:
Reported-by: Maxim Boguk
Changed files
src/backend/parser/scansup.ccheck_ident_for_unicode_whitespace()functionsrc/include/parser/scansup.hsrc/backend/parser/scan.l{identifier}and{xufailed}rulessrc/test/regress/sql/unicode.sqlsrc/test/regress/expected/unicode.outTest plan
IS NULL,IS TRUEetc. still workhttps://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51
Generated by Claude Code