Reject Unicode whitespace characters in SQL identifiers by NikolayS · Pull Request #38 · NikolayS/postgres

NikolayS · 2026-05-26T02:32:18Z

Summary

Fix a "Trojan Source" vulnerability where Unicode whitespace characters (e.g., NO-BREAK SPACE U+00A0) in unquoted identifiers create visually deceptive queries that parse with different semantics than what code review suggests
Add check_ident_for_unicode_whitespace() in the scanner support code that rejects any unquoted identifier containing a Unicode White_Space property character, using PostgreSQL's existing unicode_category infrastructure
Covers all 18 Unicode whitespace characters: U+00A0, U+1680, U+2000–U+200A, U+2028, U+2029, U+202F, U+205F, U+3000

The vulnerability

The flex scanner's ident_start/ident_cont rules use \200-\377 byte ranges, which match all non-ASCII bytes — including bytes that form multi-byte Unicode whitespace. For example:

-- What you SEE in code review, logs, git diffs:
SELECT id, login, password IS NULL FROM users;

-- What PostgreSQL PARSES (NBSP between "is" and "null"):
-- password AS "is null"  →  leaks the actual password value

The NO-BREAK SPACE (U+00A0) between is and null is invisible in terminals, editors, and GitHub. The scanner produces one identifier token is<NBSP>null instead of keywords IS NULL, turning a NULL check into a column alias.

This also enables poisoned views:

CREATE VIEW safe_users AS
  SELECT id, login FROM users WHERE superuser IS TRUE;
  -- With NBSP between IS and TRUE: returns ALL users

Reported-by: Maxim Boguk

Changed files

File	Change
`src/backend/parser/scansup.c`	New `check_ident_for_unicode_whitespace()` function
`src/include/parser/scansup.h`	Declaration for the new function
`src/backend/parser/scan.l`	Call the check in `{identifier}` and `{xufailed}` rules
`src/test/regress/sql/unicode.sql`	Regression tests
`src/test/regress/expected/unicode.out`	Expected test output

Test plan

Normal SQL queries with IS NULL, IS TRUE etc. still work
Legitimate non-Latin identifiers (Cyrillic, CJK, accented Latin) still work
U+00A0 (NO-BREAK SPACE) in unquoted identifier → ERROR
U+2002 (EN SPACE) in unquoted identifier → ERROR
U+3000 (IDEOGRAPHIC SPACE) in unquoted identifier → ERROR
U+1680 (OGHAM SPACE MARK) in unquoted identifier → ERROR
Unicode whitespace in string data (not identifiers) still works
All 228 regression tests pass

https://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51

Generated by Claude Code

The flex scanner's identifier rules use byte ranges (\200-\377) to match non-ASCII characters, but this range also matches bytes that form multi-byte Unicode whitespace characters such as NO-BREAK SPACE (U+00A0). This creates a "Trojan Source" vulnerability where queries that appear normal under visual inspection parse with entirely different semantics. For example, a NO-BREAK SPACE between "is" and "null" causes the scanner to produce a single identifier token "is<NBSP>null" instead of the keywords IS NULL, which can silently turn a NULL check into a column alias, leaking sensitive data. Add check_ident_for_unicode_whitespace() which rejects any unquoted identifier containing a character with the Unicode White_Space property. The check uses PostgreSQL's existing unicode_category infrastructure and only applies to UTF-8 encoded databases. All 18 Unicode whitespace characters (U+00A0, U+1680, U+2000-U+200A, U+2028, U+2029, U+202F, U+205F, U+3000) are covered. Reported-by: Maxim Boguk https://claude.ai/code/session_01LKw2qMLrvkpxSKHFQ6eN51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject Unicode whitespace characters in SQL identifiers#38

Reject Unicode whitespace characters in SQL identifiers#38
NikolayS wants to merge 1 commit into
masterfrom
claude/busy-clarke-eEG0K

NikolayS commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NikolayS commented May 26, 2026

Summary

The vulnerability

Changed files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants