-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Enforce explicit opt-in for WITHIN GROUP syntax in aggregate UDAFs
#18607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Ensure SQL WITHIN GROUP(...) is only allowed for UDAFs that opt into ordered-set semantics. Cached the support check locally and updated the condition to return an error when the clause is incorrectly used. Added a unit test to verify this rejection for non-ordered-set aggregate functions.
…UDAFs and enhance upgrade documentation for stricter planner enforcement
…DAFs and remove obsolete test file
…n for non-ordered-set aggregates
Refactor within_group_rejected_for_non_ordered_set_udaf to extract the planner error using .expect_err(...).to_string() to avoid brittle Debug formatting issues. Add within_group_allowed_for_ordered_set_udaf to test percentile_cont_udaf() registration with the planner, ensuring the planning of WITHIN GROUP invocation succeeds and verifying that the resulting plan mentions percentile_cont.
… WITHIN GROUP clause
…_integration tests
Add a concise comment for explicit opt-in in the planner. Extract argument-prepending logic into a helper method to centralize the protocol. Add tests to validate rejection of multiple ORDER BY items inside WITHIN GROUP and the combination of USING WITHIN GROUP with OVER.
…rn types and improve readability
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask how much of this PR was LLM generated? I see a lot of hallmarks of LLM generated code.
datafusion/sql/src/expr/function.rs
Outdated
| // Type alias to simplify very complex tuple return types used by helper | ||
| // functions in this module. Using a named alias reduces the type complexity | ||
| // shown in compiler errors and improves readability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments feel overly verbose
Also having this type alias so far away from its only usage feels odd to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment.
Moving the alias closer to fn extract_and_prepend_within_group_args generated this compiler error
error[E0658]: inherent associated types are unstable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a stable Rust release, this errors is resolved by moving the type alias outside of the impl block and making it a top-level type alias
datafusion/sql/src/expr/function.rs
Outdated
| // Enforce explicit opt-in for WITHIN GROUP: UDAFs must advertise | ||
| // `supports_within_group_clause()` to accept WITHIN GROUP syntax. | ||
| // Previously the planner forwarded WITHIN GROUP to any | ||
| // order-sensitive aggregate; that was too permissive. | ||
| let supports_within_group = fm.supports_within_group_clause(); | ||
|
|
||
| if !within_group.is_empty() && !supports_within_group { | ||
| return plan_err!( | ||
| "WITHIN GROUP is only supported for ordered-set aggregate functions" | ||
| ); | ||
| } | ||
|
|
||
| // If the UDAF opted into WITHIN GROUP handling, convert the | ||
| // WITHIN GROUP ordering into sort expressions and prepend the | ||
| // ordering expressions to the function argument list (as unnamed | ||
| // args). This helper centralizes the argument-prepending protocol | ||
| // so it's easier to maintain and reason about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for these comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended to be more succinct.
datafusion/sql/src/expr/function.rs
Outdated
| args: Vec<Expr>, | ||
| arg_names: Vec<Option<String>>, | ||
| schema: &DFSchema, | ||
| planner_context: &mut PlannerContext, | ||
| ) -> Result<WithinGroupExtraction> { | ||
| let mut args = args; | ||
| let mut arg_names = arg_names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| args: Vec<Expr>, | |
| arg_names: Vec<Option<String>>, | |
| schema: &DFSchema, | |
| planner_context: &mut PlannerContext, | |
| ) -> Result<WithinGroupExtraction> { | |
| let mut args = args; | |
| let mut arg_names = arg_names; | |
| mut args: Vec<Expr>, | |
| mut arg_names: Vec<Option<String>>, | |
| schema: &DFSchema, | |
| planner_context: &mut PlannerContext, | |
| ) -> Result<WithinGroupExtraction> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure adding these tests here are necessary; SLTs should be sufficient and I believe some of the tests added here are already in SLTs, e.g.
datafusion/datafusion/sqllogictest/test_files/aggregate.slt
Lines 174 to 179 in d57e215
| # Not supported over sliding windows | |
| query error DataFusion error: Error during planning: OVER and WITHIN GROUP clause cannot be used together. OVER is for window functions, whereas WITHIN GROUP is for ordered set aggregate functions | |
| SELECT approx_percentile_cont(0.5) | |
| WITHIN GROUP (ORDER BY c3) | |
| OVER (ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) | |
| FROM aggregate_test_100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the tests that are already covered in slt.
| ## WITHIN GROUP / Ordered-set aggregates | ||
|
|
||
| Some aggregate functions support the SQL `WITHIN GROUP (ORDER BY ...)` clause. This clause is used by | ||
| ordered-set aggregate functions (for example, percentile and rank-like aggregations) to specify the ordering | ||
| of inputs that the aggregate relies on. In DataFusion, only aggregate functions that explicitly opt into | ||
| ordered-set semantics via their implementation will accept `WITHIN GROUP`; attempting to use `WITHIN GROUP` | ||
| with a regular aggregate (for example `SUM(x) WITHIN GROUP (ORDER BY x)`) will fail during planning with an | ||
| error. This matches Postgres semantics and ensures ordered-set behavior is opt-in for user-defined aggregates. | ||
|
|
||
| Example (ordered-set aggregate): | ||
|
|
||
| ```sql | ||
| percentile_cont(0.5) WITHIN GROUP (ORDER BY value) | ||
| ``` | ||
|
|
||
| Example (invalid usage — planner will error): | ||
|
|
||
| ```sql | ||
| -- This will fail: SUM is not an ordered-set aggregate | ||
| SELECT SUM(x) WITHIN GROUP (ORDER BY x) FROM t; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to add this to our documentation we should be explicit about which functions are ordered set aggregates. As it is, the user cannot tell by this alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered-set aggregate functions (for example, percentile and rank-like aggregations) to specify the ordering
of inputs that the aggregate relies on. In DataFusion, only aggregate functions that explicitly opt into
ordered-set semantics via their implementation will acceptWITHIN GROUP; attempting to useWITHIN GROUP
I thought this part gave a good indication of ordered set aggregate functions.
…oup_args function
…ITHIN GROUP / percentile_cont)
…raction alias to module scope

Which issue does this PR close?
Closes #18109.
Rationale for this change
Previously, the SQL planner accepted
WITHIN GROUPclauses for all aggregate UDAFs, even those that did not explicitly support ordered-set semantics. This behavior was too permissive and inconsistent with PostgreSQL. For example, queries such asSUM(x) WITHIN GROUP (ORDER BY x)were allowed, even thoughSUMis not an ordered-set aggregate.This PR enforces stricter validation so that only UDAFs that explicitly return
truefromsupports_within_group_clause()may useWITHIN GROUP. All other aggregates now produce a clear planner error when this syntax is used.What changes are included in this PR?
Added type alias
WithinGroupExtractionto simplify complex tuple return types used by helper functions.Introduced a new helper method
extract_and_prepend_within_group_argsto centralize logic for handlingWITHIN GROUPargument rewriting.Updated the planner to:
supports_within_group_clause()can acceptWITHIN GROUP.WITHIN GROUPordering expressions to function arguments only for supported ordered-set aggregates.WITHIN GROUPis used incorrectly.Added comprehensive unit tests verifying correct behavior and failure cases:
WITHIN GROUPrejected for non-ordered-set aggregates (MIN,SUM, etc.).WITHIN GROUPaccepted for ordered-set aggregates such aspercentile_cont.OVERclauses.Updated SQL logic tests (
aggregate.slt) to reflect new rejection behavior.Updated documentation:
aggregate_functions.mdand developer docs to clarify when and howWITHIN GROUPcan be used.upgrading.mdto inform users of this stricter enforcement and migration guidance.Are these changes tested?
✅ Yes.
sql_integration.rsvalidate acceptance, rejection, and argument behavior ofWITHIN GROUPfor both valid and invalid cases.aggregate.slt) include negative test cases confirming planner rejections.Are there any user-facing changes?
✅ Yes.
Users attempting to use
WITHIN GROUPwith regular aggregates (e.g.SUM,AVG,MIN,MAX) will now see a planner error:Documentation has been updated to clearly describe
WITHIN GROUPsemantics and provide examples of valid and invalid usage.No API-breaking changes were introduced; only stricter planner validation and improved error messaging.