Possible out-of-bounds read / crash in C++ runtime `ATNDeserializer` (Serialized ATN parsing) #4908

siewer · 2025-12-03T18:26:27Z

siewer
Dec 3, 2025

Hi maintainers 👋
I couldn’t find a dedicated security contact (e.g., SECURITY.md or a private reporting address) in this repository, so I’m posting this as a public GitHub Discussion.

Tone/intent: I’m trying to be careful with details because this looks like a memory-safety issue in the C++ runtime. I’m sharing enough to enable triage and reproduce safely, while avoiding “drop-in” exploitation guidance. If you prefer a private channel, please point me to it and I’ll immediately move the full technical PoC/inputs there.

Summary (what I think is happening)

In the C++ runtime, ATNDeserializer::deserialize(...) appears to read from a SerializedATNView using an index p (e.g., data[p++]) without consistently verifying that p remains within bounds of the backing buffer.

The deserialization logic is driven by counts embedded in the serialized data itself (e.g., nstates, nedges, etc.). If an attacker can influence the serialized ATN payload (even in a non-standard integration), they can set these counts such that the deserializer attempts to read beyond the end of the provided buffer, leading to:

Denial of Service (segmentation fault / crash)
Potential information disclosure via out-of-bounds read (less likely than crash, but conceptually possible in C++ OOB read scenarios)

This matches the general pattern of CWE-125: Out-of-bounds Read.

Affected area (where to look)

File: ATNDeserializer.cpp (C++ runtime)
Function: ATNDeserializer::deserialize(SerializedATNView data) const

The key pattern is:

size_t nstates = data[p++]; (reads an untrusted count)
followed by loops like for (size_t i = 0; i < nstates; i++) { ... data[p++] ... }
where p increments repeatedly and reads are performed without strong bounds checks tied to data.size().

(If my file path or exact lines differ due to branch/version, the above describes the logic pattern I observed.)

Why this matters (threat model)

In “classic” ANTLR usage, serialized ATN is generated by the tool and embedded in generated sources, so the payload is usually trusted.

However, real-world applications sometimes:

load grammars / generated artifacts as plugins,
accept/generated parsers from external sources,
expose parsing components in multi-tenant environments,
or otherwise end up deserializing serialized ATN that is not fully trusted.

In those integrations, an unsafe deserializer becomes a reliability and security risk.

Even if you consider “attacker controls serialized ATN” to be out-of-scope, I still think hardening is warranted because the fix is localized and improves robustness (and it may help fuzzing / defensive posture overall).

Reproduction (high-level, safe)

This is the minimal idea to reproduce a crash (details can be provided privately):

Construct a serialized ATN byte/int sequence where an early field like nstates is set to a very large value.
Provide a backing buffer that is far smaller than what would be required for that many states/edges/transitions.
Call ATNDeserializer::deserialize(...) with this SerializedATNView.
Observe that the deserializer continues reading data[p++] until it crosses the buffer boundary, eventually causing an invalid read and crash.

Suggested remediation / hardening options

Any of the following would address the core issue:

Centralized bounds checking in SerializedATNView::operator[]
- Ensure any indexed read checks index < size (and fails fast with a controlled exception/error).
Explicit checks in ATNDeserializer before each read
- e.g., guard reads with something like if (p >= data.size()) throw ...;
- and/or create helper methods readInt() / readShort() that track p and validate bounds.
Add “remaining length” validation for sections driven by counts
- For example, before reading a section of nstates elements, validate that enough buffer remains for at least the minimal representation of each item (even if formats differ).

The goal would be to turn “crash on malformed input” into a controlled parse failure.

Requested next steps

Please confirm whether you consider this:
- a security vulnerability to be tracked (CVE / advisory), or
- a robustness bug to be fixed without a security process.
- not a problem after validation
If there is a preferred private reporting path, please share it and I’ll provide:
- full PoC input,
- exact commit/version tested,

Thanks for taking a look

ericvergnaud · 2025-12-04T13:57:29Z

ericvergnaud
Dec 4, 2025
Maintainer

Hey, thanks for this. If I understand correctly, there is a risk that if the serialized ATN is malformed, then the deserializer will crash?
Indeed.
As you rightly observe, this can only happen if the serialized data is tampered.
You write:

If an attacker can influence the serialized ATN payload

Can you provide an example of how they would do that? Not saying they can't, but if they can then they can also do all sorts of very mean things, no? How would protecting the running process against data injection in this specific area prevent from other ways to break the running process? (genuinely trying to understand)

0 replies

siewer · 2025-12-04T14:43:26Z

siewer
Dec 4, 2025
Author

Thanks for the quick reply - yes, that’s exactly the risk: if the serialized ATN is malformed (e.g., truncated or with inflated counts), the C++ deserializer can read past the end of the provided buffer and crash.

I fully agree with your point about the “classic” ANTLR workflow: normally the serialized ATN is generated at build time and embedded in the generated sources, so it is effectively trusted. Where this becomes more than a theoretical concern is when the runtime ends up deserializing ATN data that is not strictly controlled by the application author. A few realistic patterns I’ve seen:

Plugin / extension ecosystems: host applications that load language support packs / generated parsers as plugins or add-ons. “Tampering” here can be as simple as installing a buggy/malicious plugin artifact (or supply-chain accident). A crash becomes a clean DoS of the host.
Multi-tenant or service setups: “parsing as a service” / IDE backends / analysis pipelines that process external grammar artifacts. Even if sandboxing is the main control, a safe failure mode (reject data) is still valuable defense-in-depth.
Corruption/truncation: not necessarily malicious - partial writes, packaging mistakes, or artifact corruption can lead to malformed serialized data. Bounds checks turn this from a hard crash into a controlled error path, improving robustness.

So I’m not claiming this prevents a fully-powerful attacker who can arbitrarily modify process memory or binaries. The value is mainly: if an attacker (or a faulty artifact) can influence a narrow input surface that reaches ATNDeserializer, this avoids “easy crash” and enforces “fail closed”.

Below is the key excerpt that demonstrates the core bug pattern (unchecked count → loop → OOB read), matching the structure in ATNDeserializer:

// Attacker-controlled count drives reads with no bounds checks.
size_t nstates = data[p++];

for (size_t i = 0; i < nstates; i++) {
  int stype     = data[p++]; // OOB read when buffer is too small
  int ruleIndex = data[p++]; // further OOB reads
}

And the smallest harness idea is: provide only a few ints (header + nstates), but set nstates huge so the loop forces p to run past the end of the buffer and crash.

If it helps your triage, I can share a tiny standalone reproducer plus an AddressSanitizer trace that clearly shows the out-of-bounds read

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible out-of-bounds read / crash in C++ runtime `ATNDeserializer` (Serialized ATN parsing) #4908

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Possible out-of-bounds read / crash in C++ runtime ATNDeserializer (Serialized ATN parsing) #4908

Uh oh!

siewer Dec 3, 2025

Summary (what I think is happening)

Affected area (where to look)

Why this matters (threat model)

Reproduction (high-level, safe)

Suggested remediation / hardening options

Requested next steps

Replies: 2 comments

Uh oh!

Uh oh!

ericvergnaud Dec 4, 2025 Maintainer

Uh oh!

Uh oh!

siewer Dec 4, 2025 Author

Possible out-of-bounds read / crash in C++ runtime `ATNDeserializer` (Serialized ATN parsing) #4908

siewer
Dec 3, 2025

ericvergnaud
Dec 4, 2025
Maintainer

siewer
Dec 4, 2025
Author