-
Notifications
You must be signed in to change notification settings - Fork 3
[Refactor] Basic Query Parser #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,23 +1,354 @@ | ||||||||||||||||||||||||||||
| from core.ast.node import QueryNode | ||||||||||||||||||||||||||||
| from core.ast.node import ( | ||||||||||||||||||||||||||||
| Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode, | ||||||||||||||||||||||||||||
| LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode, | ||||||||||||||||||||||||||||
| OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode | ||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||
| from core.ast.enums import JoinType, SortOrder | ||||||||||||||||||||||||||||
| import mo_sql_parsing as mosql | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| class QueryParser: | ||||||||||||||||||||||||||||
| @staticmethod | ||||||||||||||||||||||||||||
| def normalize_to_list(value): | ||||||||||||||||||||||||||||
| """Normalize mo_sql_parsing output to a list format. | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| mo_sql_parsing returns: | ||||||||||||||||||||||||||||
| - list when multiple items | ||||||||||||||||||||||||||||
| - dict when single item with structure | ||||||||||||||||||||||||||||
| - str when single simple value | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| This normalizes all cases to a list. | ||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||
| if value is None: | ||||||||||||||||||||||||||||
| return [] | ||||||||||||||||||||||||||||
| elif isinstance(value, list): | ||||||||||||||||||||||||||||
| return value | ||||||||||||||||||||||||||||
| elif isinstance(value, (dict, str)): | ||||||||||||||||||||||||||||
| return [value] | ||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||
| return [value] | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
| return [value] | |
| raise TypeError( | |
| f"normalize_to_list: Unexpected type {type(value).__name__} for value {value!r}. " | |
| "Expected None, list, dict, or str." | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion. @HazelYuAhiru, let's adopt this suggestion.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.aliases instance variable is initialized within the parse() method rather than in __init__. This means the aliases dictionary persists between parse calls and could lead to stale alias references affecting subsequent parses. Consider either: 1) initializing self.aliases = {} in a proper __init__ method, or 2) passing aliases as a parameter through the helper methods instead of using instance state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point which I also noticed. If we use the instance member to store aliases, this instance is not safe for multi-threading. [1] One way to solve this issue is through a helper parameter being passed through call the following function calls through the function arguments, similar to what we use as the memo in the rewriter engine. [2] Or we can make sure the QueryParser is declared as non-thread-safe, and every place we need to use the parse function, we create a new instance of the parser, i.e., parser = QueryParser(), and if we use this approach, it makes more sense to init the self.aliases = {} in the __init__ function instead of this parse function. @HazelYuAhiru , please make a decision with coordination with @colinthebomb1 's PR, if he uses the same approach (some internal state shared by different functions as an instance-level state), we can use [2]. If he already used [1], you may change to use [1]. If you decide to use [2], please find a proper annotation tag in Python to say it is non-thread-safe.
baiqiushi marked this conversation as resolved.
Show resolved
Hide resolved
baiqiushi marked this conversation as resolved.
Show resolved
Hide resolved
baiqiushi marked this conversation as resolved.
Show resolved
Hide resolved
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse_from method raises a generic ValueError with message "JOIN found without a left table" without providing context about which join or what the input was. Consider including more diagnostic information in the error message, such as the join_key or item being processed, to aid debugging.
| raise ValueError("JOIN found without a left table") | |
| raise ValueError(f"JOIN found without a left table. join_key={join_key}, item={item}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion. Let's adopt it.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JoinNode constructor expects _left_table: 'TableNode' as its first parameter (see core/ast/node.py:163), but left_source could be a JoinNode from a previous iteration (as indicated by the comment on line 132 and assignment on line 133). This will cause a type error when chaining multiple joins. The JoinNode signature should be updated to accept Node types, or the parsing logic should be restructured to handle join chains differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Should we change the constructor of JoinNode to allow either JoinNode or TableNode as _left_table?
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FromNode constructor expects a list but receives a set here. Throughout the parse_from method, sources is maintained as a set (initialized on line 97, with add() operations on lines 148, 155, 159), but FromNode.init signature expects _sources: List['Node'] (see core/ast/node.py:195). This type inconsistency needs to be fixed by making sources a list throughout this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Let's fix it.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WhereNode constructor expects a list but receives a set here. The predicates variable is initialized as a set (line 164) and set operations are used (line 165), but WhereNode.init signature expects _predicates: List['Node'] (see core/ast/node.py:201). Change to use a list instead of a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HavingNode constructor expects a list but receives a set here. The predicates variable is initialized as a set (line 185) and set operations are used (line 190), but HavingNode.init signature expects _predicates: List['Node'] (see core/ast/node.py:213). Change to use a list instead of a set.
| predicates = set() | |
| expr = self.parse_expression(having_dict) | |
| # Check if this expression references an aliased function from SELECT | |
| expr = self.resolve_aliases(expr) | |
| predicates.add(expr) | |
| predicates = [] | |
| expr = self.parse_expression(having_dict) | |
| # Check if this expression references an aliased function from SELECT | |
| expr = self.resolve_aliases(expr) | |
| predicates.append(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the OperatorNode is created with unary operators (e.g., 'NOT'), only one operand is passed (line 309). The resolve_aliases method assumes binary operators and always accesses children[0] and children[1] (lines 227-228). This will cause an IndexError when resolving aliases for unary operators. Add a check for the number of children before accessing indices.
| left = self.resolve_aliases(expr.children[0]) | |
| right = self.resolve_aliases(expr.children[1]) | |
| return OperatorNode(left, expr.name, right) | |
| if len(expr.children) == 1: | |
| child = self.resolve_aliases(expr.children[0]) | |
| return OperatorNode(child, expr.name) | |
| elif len(expr.children) == 2: | |
| left = self.resolve_aliases(expr.children[0]) | |
| right = self.resolve_aliases(expr.children[1]) | |
| return OperatorNode(left, expr.name, right) | |
| else: | |
| # Unexpected number of children; return as is | |
| return expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. But I doubt if we can resolve it in this PR since we don't have such test cases. Let's push this fix to later PRs where we introduce more test cases.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The alias resolution in the HAVING clause iterates through all aliases to find matching FunctionNode or ColumnNode instances (lines 233-241, 246-252). For large SELECT clauses with many aliases, this could be inefficient. Consider optimizing by creating a separate lookup structure or only checking relevant aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is OK for now.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LiteralNode constructor expects _value: str|int|float|bool|datetime|None (see core/ast/node.py:96), but a tuple is being passed here when parsing list expressions. Tuples are not in the allowed types for LiteralNode values. This could cause type errors or unexpected behavior. Consider either updating LiteralNode to support tuple types or handling list literals differently.
| # List literals (for IN clauses) - convert to tuple for hashability | |
| parsed = tuple(self.parse_expression(item) for item in expr) | |
| return LiteralNode(parsed) | |
| # List literals (for IN clauses) | |
| parsed = [self.parse_expression(item) for item in expr] | |
| return parsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point. Can we fix it?
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] In the test's expected AST, count_star is created with _args as a keyword argument, but in the parser (line 314), FunctionNode is called with positional arguments. While this works since _args is the second positional parameter after _name, using keyword arguments consistently would improve code clarity and match the test's style. Consider changing line 314 to: return FunctionNode(op_name, _args=[ColumnNode('*')])
| return FunctionNode(op_name, [ColumnNode('*')]) | |
| return FunctionNode(op_name, _args=[ColumnNode('*')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check. I am not sure.
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json module is imported inline (line 321) rather than at the top of the file. Move this import to the top of the file with other imports to follow Python best practices (PEP 8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'SubqueryNode' is not used.
Import of 'VarNode' is not used.
Import of 'VarSetNode' is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also noticed this version of implementation does not consider subquery. We will consider it in the next iteration.