Skip to content

Reject Unicode whitespace characters in SQL identifiers#38

Draft
NikolayS wants to merge 1 commit into
masterfrom
claude/busy-clarke-eEG0K
Draft

Reject Unicode whitespace characters in SQL identifiers#38
NikolayS wants to merge 1 commit into
masterfrom
claude/busy-clarke-eEG0K

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Summary

  • Fix a "Trojan Source" vulnerability where Unicode whitespace characters (e.g., NO-BREAK SPACE U+00A0) in unquoted identifiers create visually deceptive queries that parse with different semantics than what code review suggests
  • Add check_ident_for_unicode_whitespace() in the scanner support code that rejects any unquoted identifier containing a Unicode White_Space property character, using PostgreSQL's existing unicode_category infrastructure
  • Covers all 18 Unicode whitespace characters: U+00A0, U+1680, U+2000–U+200A, U+2028, U+2029, U+202F, U+205F, U+3000

The vulnerability

The flex scanner's ident_start/ident_cont rules use \200-\377 byte ranges, which match all non-ASCII bytes — including bytes that form multi-byte Unicode whitespace. For example:

-- What you SEE in code review, logs, git diffs:
SELECT id, login, password IS NULL FROM users;

-- What PostgreSQL PARSES (NBSP between "is" and "null"):
-- password AS "is null"  →  leaks the actual password value

The NO-BREAK SPACE (U+00A0) between is and null is invisible in terminals, editors, and GitHub. The scanner produces one identifier token is<NBSP>null instead of keywords IS NULL, turning a NULL check into a column alias.

This also enables poisoned views:

CREATE VIEW safe_users AS
  SELECT id, login FROM users WHERE superuser IS TRUE;
  -- With NBSP between IS and TRUE: returns ALL users

Reported-by: Maxim Boguk

Changed files

File Change
src/backend/parser/scansup.c New check_ident_for_unicode_whitespace() function
src/include/parser/scansup.h Declaration for the new function
src/backend/parser/scan.l Call the check in {identifier} and {xufailed} rules
src/test/regress/sql/unicode.sql Regression tests
src/test/regress/expected/unicode.out Expected test output

Test plan

  • Normal SQL queries with IS NULL, IS TRUE etc. still work
  • Legitimate non-Latin identifiers (Cyrillic, CJK, accented Latin) still work
  • U+00A0 (NO-BREAK SPACE) in unquoted identifier → ERROR
  • U+2002 (EN SPACE) in unquoted identifier → ERROR
  • U+3000 (IDEOGRAPHIC SPACE) in unquoted identifier → ERROR
  • U+1680 (OGHAM SPACE MARK) in unquoted identifier → ERROR
  • Unicode whitespace in string data (not identifiers) still works
  • All 228 regression tests pass

https://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51


Generated by Claude Code

The flex scanner's identifier rules use byte ranges (\200-\377) to match
non-ASCII characters, but this range also matches bytes that form
multi-byte Unicode whitespace characters such as NO-BREAK SPACE (U+00A0).

This creates a "Trojan Source" vulnerability where queries that appear
normal under visual inspection parse with entirely different semantics.
For example, a NO-BREAK SPACE between "is" and "null" causes the scanner
to produce a single identifier token "is<NBSP>null" instead of the
keywords IS NULL, which can silently turn a NULL check into a column
alias, leaking sensitive data.

Add check_ident_for_unicode_whitespace() which rejects any unquoted
identifier containing a character with the Unicode White_Space property.
The check uses PostgreSQL's existing unicode_category infrastructure and
only applies to UTF-8 encoded databases. All 18 Unicode whitespace
characters (U+00A0, U+1680, U+2000-U+200A, U+2028, U+2029, U+202F,
U+205F, U+3000) are covered.

Reported-by: Maxim Boguk

https://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants