Fix JSON deserialization of empty containers and string-embedded delimiters#845
Conversation
parse_array and parse_object now track quoted-string state (with
backslash escapes) while scanning values, accept empty [] and {}
containers, and reject trailing commas. A shared
scan_balanced_expression helper replaces the three raw brace/bracket
counting loops, so delimiters embedded in nested string values no
longer split a container in the wrong place. Adds 20 Json gtest cases.
tigercosmos
left a comment
There was a problem hiding this comment.
@yungyuc Please take a look, thanks!
| // (with backslash escapes) are ignored when matching, so brackets or braces | ||
| // embedded in string values do not unbalance the scan. Throws when the | ||
| // container is not closed. | ||
| inline void scan_balanced_expression(const std::string & json, size_t & index, char open_char, char close_char, int & line, int & column, std::string & out) |
There was a problem hiding this comment.
Extract the reused logic.
| // Track quoted-string state so that ',' or ']' inside a | ||
| // string value does not terminate the scan early. |
| // Track quoted-string state so that ',' or '}' inside a | ||
| // string value does not terminate the scan early. |
| // sequences. They exercise the parser directly through detail::JsonNode so | ||
| // that the produced structure can be inspected precisely. | ||
|
|
||
| TEST(Json, parse_empty_array) |
There was a problem hiding this comment.
More test cases to increae the coverage.
There was a problem hiding this comment.
I'd like the tests to be moved to Python, which requires the Python wrapper for the serialization code. Please help create an issue to track the work (it's future work that does not need to be included in this PR).
There was a problem hiding this comment.
@yungyuc Sure, let me move all serialization test to python in the next PR.
yungyuc
left a comment
There was a problem hiding this comment.
The fix looks good to me.
A derived issue: The JSON processing (serialization) code should live in C++ for performance and flexibility, but the tests should be in Python for ease of maintenance and natural documentation. It should be fixed in the future.
| // sequences. They exercise the parser directly through detail::JsonNode so | ||
| // that the produced structure can be inspected precisely. | ||
|
|
||
| TEST(Json, parse_empty_array) |
There was a problem hiding this comment.
I'd like the tests to be moved to Python, which requires the Python wrapper for the serialization code. Please help create an issue to track the work (it's future work that does not need to be included in this PR).
The hand-rolled JSON deserializer in SerializableItem.cpp could not parse several valid inputs and silently accepted some invalid ones. Empty arrays
[]and empty objects{}were rejected outright, and any primitive string value embedding a structural delimiter (,,],}) was truncated at the first delimiter because the value scanner did not track quoted strings. Both parse_array and parse_object now scan leaf values with quoted-string and backslash-escape awareness, and accept empty containers.The same string-blindness affected nested containers: the loops that extracted a nested
{...}or[...]counted raw braces and brackets, so a string value such as"a]b"inside a nested array still split the container at the wrong place. Those three loops are replaced by a single shared scan_balanced_expression helper that ignores string contents (and their escapes) when matching the closing delimiter, so nested structures like{"k":["a]b"]}parse correctly.Trailing commas are now rejected rather than silently tolerated: an empty container stays valid, but a closing delimiter immediately following a comma raises a parse error, keeping the parser aligned with strict JSON.
The change adds twenty gtest cases under the Json suite covering empty and whitespace-only containers, leaf and nested string values that embed each delimiter, escaped quotes and backslashes through both the leaf and container scanners, trailing/leading/double-comma rejection, unterminated nested containers, and an end-to-end round trip. A few pre-existing weaknesses are intentionally deferred: the End state still accepts a single trailing character after the closing bracket, malformed end-of-input can leave an empty value expression before classification, and number classification remains a character-class check.