Skip to content

Conversation

@HazelYuAhiru
Copy link
Collaborator

Overview:

This PR implements the query parser and adds tests for converting an input SQL query string into our custom AST structure (SQL → JSON via mosql → custom AST). The base test case passes, and all additional test cases run without errors.

Code Changes:

  • Added a base parsing test in tests/test_query_parser.py.
  • Implemented parse() and its helper functions.

colinthebomb1
colinthebomb1 previously approved these changes Dec 1, 2025
Copy link
Collaborator

@colinthebomb1 colinthebomb1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! let one small comment

count_star = FunctionNode("COUNT", _alias="emp_count", _args=[ColumnNode("*")])

# SELECT clause
select_clause = SelectNode({emp_name, dept_name, count_star})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses lists [] instead of sets {} to make sure the order is maintained in clauses for formatting them back to SQL later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! PR updated

@HazelYuAhiru HazelYuAhiru marked this pull request as ready for review December 3, 2025 22:17
@baiqiushi baiqiushi requested a review from Copilot December 4, 2025 00:02
Copilot finished reviewing on behalf of baiqiushi December 4, 2025 00:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a query parser that converts SQL strings to a custom AST structure via the mosql library. It adds a basic parsing test and implements the core parse() method along with helper functions for parsing different SQL clauses.

Key Changes:

  • Implements QueryParser.parse() method that converts SQL to AST via mosql intermediate JSON
  • Adds helper methods for parsing SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY clauses
  • Implements expression parsing, alias resolution, and operator/join type normalization
  • Adds comprehensive test case test_basic_parse() that validates parsing of complex queries with JOINs, aggregations, and multiple clauses

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File Description
tests/test_query_parser.py Adds test_basic_parse() function with comprehensive test case covering JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, and OFFSET; updates imports to include new node types
core/query_parser.py Implements complete query parser with parse() method and 10+ helper methods for clause parsing, expression parsing, alias resolution, and type normalization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +224 to +255
def resolve_aliases(self, expr: Node) -> Node:
if isinstance(expr, OperatorNode):
# Recursively resolve aliases in operator operands
left = self.resolve_aliases(expr.children[0])
right = self.resolve_aliases(expr.children[1])
return OperatorNode(left, expr.name, right)
elif isinstance(expr, FunctionNode):
# Check if this function matches an aliased function from SELECT
if expr.alias is None:
for alias, aliased_expr in self.aliases.items():
if isinstance(aliased_expr, FunctionNode):
if (expr.name == aliased_expr.name and
len(expr.children) == len(aliased_expr.children) and
all(expr.children[i] == aliased_expr.children[i]
for i in range(len(expr.children)))):
# This function matches an aliased one, use the alias
expr.alias = alias
break
return expr
elif isinstance(expr, ColumnNode):
# Check if this column matches an aliased column from SELECT
if expr.alias is None:
for alias, aliased_expr in self.aliases.items():
if isinstance(aliased_expr, ColumnNode):
if (expr.name == aliased_expr.name and
expr.parent_alias == aliased_expr.parent_alias):
# This column matches an aliased one, use the alias
expr.alias = alias
break
return expr
else:
return expr
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The alias resolution in the HAVING clause iterates through all aliases to find matching FunctionNode or ColumnNode instances (lines 233-241, 246-252). For large SELECT clauses with many aliases, this could be inefficient. Consider optimizing by creating a separate lookup structure or only checking relevant aliases.

Copilot uses AI. Check for mistakes.
elif isinstance(value, (dict, str)):
return [value]
else:
return [value]
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The docstring states that the function normalizes to list format and handles dict, str, and other types, but the final else clause (lines 27-28) will wrap ANY other type (including unexpected types) into a list without validation. This could hide bugs where unexpected types are passed. Consider either being more explicit about what "other types" are expected, or raising an error for truly unexpected types.

Suggested change
return [value]
raise TypeError(
f"normalize_to_list: Unexpected type {type(value).__name__} for value {value!r}. "
"Expected None, list, dict, or str."
)

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +190
predicates = set()
expr = self.parse_expression(having_dict)
# Check if this expression references an aliased function from SELECT
expr = self.resolve_aliases(expr)

predicates.add(expr)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HavingNode constructor expects a list but receives a set here. The predicates variable is initialized as a set (line 185) and set operations are used (line 190), but HavingNode.init signature expects _predicates: List['Node'] (see core/ast/node.py:213). Change to use a list instead of a set.

Suggested change
predicates = set()
expr = self.parse_expression(having_dict)
# Check if this expression references an aliased function from SELECT
expr = self.resolve_aliases(expr)
predicates.add(expr)
predicates = []
expr = self.parse_expression(having_dict)
# Check if this expression references an aliased function from SELECT
expr = self.resolve_aliases(expr)
predicates.append(expr)

Copilot uses AI. Check for mistakes.
# FROM clause with JOIN
join_condition = OperatorNode(emp_dept_id, "=", dept_id)
join_node = JoinNode(emp_table, dept_table, JoinType.INNER, join_condition)
from_clause = FromNode({join_node})
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FromNode expects a list as its _sources parameter, but a set is passed here. According to core/ast/node.py line 195, FromNode.init expects _sources: List['Node']. Change {join_node} to [join_node] to match the expected type.

Suggested change
from_clause = FromNode({join_node})
from_clause = FromNode([join_node])

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +271
# List literals (for IN clauses) - convert to tuple for hashability
parsed = tuple(self.parse_expression(item) for item in expr)
return LiteralNode(parsed)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LiteralNode constructor expects _value: str|int|float|bool|datetime|None (see core/ast/node.py:96), but a tuple is being passed here when parsing list expressions. Tuples are not in the allowed types for LiteralNode values. This could cause type errors or unexpected behavior. Consider either updating LiteralNode to support tuple types or handling list literals differently.

Suggested change
# List literals (for IN clauses) - convert to tuple for hashability
parsed = tuple(self.parse_expression(item) for item in expr)
return LiteralNode(parsed)
# List literals (for IN clauses)
parsed = [self.parse_expression(item) for item in expr]
return parsed

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +166
def parse_where(self, where_dict: dict) -> WhereNode:
predicates = set()
predicates.add(self.parse_expression(where_dict))
return WhereNode(predicates)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhereNode constructor expects a list but receives a set here. The predicates variable is initialized as a set (line 164) and set operations are used (line 165), but WhereNode.init signature expects _predicates: List['Node'] (see core/ast/node.py:201). Change to use a list instead of a set.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +133
join_node = JoinNode(left_source, right_table, join_type, on_condition)
# The result of this JOIN becomes the new left source for potential next JOIN
left_source = join_node
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JoinNode constructor expects _left_table: 'TableNode' as its first parameter (see core/ast/node.py:163), but left_source could be a JoinNode from a previous iteration (as indicated by the comment on line 132 and assignment on line 133). This will cause a type error when chaining multiple joins. The JoinNode signature should be updated to accept Node types, or the parsing logic should be restructured to handle join chains differently.

Copilot uses AI. Check for mistakes.

# [2] Our new code
# Any (JSON) -> AST (QueryNode)
self.aliases = {}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.aliases instance variable is initialized within the parse() method rather than in __init__. This means the aliases dictionary persists between parse calls and could lead to stale alias references affecting subsequent parses. Consider either: 1) initializing self.aliases = {} in a proper __init__ method, or 2) passing aliases as a parameter through the helper methods instead of using instance state.

Copilot uses AI. Check for mistakes.
# Pattern 3: Function call
# Special case: COUNT(*), SUM(*), etc.
if value == '*':
return FunctionNode(op_name, [ColumnNode('*')])
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] In the test's expected AST, count_star is created with _args as a keyword argument, but in the parser (line 314), FunctionNode is called with positional arguments. While this works since _args is the second positional parameter after _name, using keyword arguments consistently would improve code clarity and match the test's style. Consider changing line 314 to: return FunctionNode(op_name, _args=[ColumnNode('*')])

Suggested change
return FunctionNode(op_name, [ColumnNode('*')])
return FunctionNode(op_name, _args=[ColumnNode('*')])

Copilot uses AI. Check for mistakes.
from core.ast.node import (
Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, OperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'SubqueryNode' is not used.
Import of 'VarNode' is not used.
Import of 'VarSetNode' is not used.

Suggested change
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, VarNode, VarSetNode, JoinNode
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, JoinNode

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants