Commit 0732e44
[SPARK-53809][SQL] Add canonicalization for DataSourceV2ScanRelation
### What changes were proposed in this pull request?
This PR proposes to add `doCanonicalize` function for DataSourceV2ScanRelation. The implementation is similar to [the one in BatchScanExec](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L150), as well as the [the one in LogicalRelation](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L52).
### Why are the changes needed?
Query optimization rules such as MergeScalarSubqueries check if two plans are identical by [comparing their canonicalized form](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala#L219). For DSv2, for physical plan, the canonicalization goes down in the child hierarchy to the BatchScanExec, which [has a doCanonicalize function](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L150); for logical plan, the canonicalization goes down to the DataSourceV2ScanRelation, which, however, does not have a doCanonicalize function. As a result, two logical plans who are semantically identical are not identified.
Moreover, for reference, [DSv1 LogicalRelation](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L52) also has `doCanonicalize()`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A new unit test is added to show that `MergeScalarSubqueries` is working for DataSourceV2ScanRelation.
For a query
```sql
select (select max(i) from df) as max_i, (select min(i) from df) as min_i
```
Before introducing the canonicalization, the plan is
```
== Parsed Logical Plan ==
'Project [scalar-subquery#2 [] AS max_i#3, scalar-subquery#4 [] AS min_i#5]
: :- 'Project [unresolvedalias('max('i))]
: : +- 'UnresolvedRelation [df], [], false
: +- 'Project [unresolvedalias('min('i))]
: +- 'UnresolvedRelation [df], [], false
+- OneRowRelation
== Analyzed Logical Plan ==
max_i: int, min_i: int
Project [scalar-subquery#2 [] AS max_i#3, scalar-subquery#4 [] AS min_i#5]
: :- Aggregate [max(i#0) AS max(i)#7]
: : +- SubqueryAlias df
: : +- View (`df`, [i#0, j#1])
: : +- RelationV2[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
: +- Aggregate [min(i#10) AS min(i)#9]
: +- SubqueryAlias df
: +- View (`df`, [i#10, j#11])
: +- RelationV2[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
+- OneRowRelation
== Optimized Logical Plan ==
Project [scalar-subquery#2 [] AS max_i#3, scalar-subquery#4 [] AS min_i#5]
: :- Aggregate [max(i#0) AS max(i)#7]
: : +- Project [i#0]
: : +- RelationV2[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
: +- Aggregate [min(i#10) AS min(i)#9]
: +- Project [i#10]
: +- RelationV2[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
+- OneRowRelation
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 0
+- *(1) Project [Subquery subquery#2, [id=#32] AS max_i#3, Subquery subquery#4, [id=#33] AS min_i#5]
: :- Subquery subquery#2, [id=#32]
: : +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) HashAggregate(keys=[], functions=[max(i#0)], output=[max(i)#7])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=58]
+- *(1) HashAggregate(keys=[], functions=[partial_max(i#0)], output=[max#14])
+- *(1) Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
HashAggregate(keys=[], functions=[max(i#0)], output=[max(i)#7])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=19]
+- HashAggregate(keys=[], functions=[partial_max(i#0)], output=[max#14])
+- Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
: +- Subquery subquery#4, [id=#33]
: +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) HashAggregate(keys=[], functions=[min(i#10)], output=[min(i)#9])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=63]
+- *(1) HashAggregate(keys=[], functions=[partial_min(i#10)], output=[min#15])
+- *(1) Project [i#10]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
HashAggregate(keys=[], functions=[min(i#10)], output=[min(i)#9])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=30]
+- HashAggregate(keys=[], functions=[partial_min(i#10)], output=[min#15])
+- Project [i#10]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- *(1) Scan OneRowRelation[]
+- == Initial Plan ==
Project [Subquery subquery#2, [id=#32] AS max_i#3, Subquery subquery#4, [id=#33] AS min_i#5]
: :- Subquery subquery#2, [id=#32]
: : +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) HashAggregate(keys=[], functions=[max(i#0)], output=[max(i)#7])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=58]
+- *(1) HashAggregate(keys=[], functions=[partial_max(i#0)], output=[max#14])
+- *(1) Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
HashAggregate(keys=[], functions=[max(i#0)], output=[max(i)#7])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=19]
+- HashAggregate(keys=[], functions=[partial_max(i#0)], output=[max#14])
+- Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
: +- Subquery subquery#4, [id=#33]
: +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) HashAggregate(keys=[], functions=[min(i#10)], output=[min(i)#9])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=63]
+- *(1) HashAggregate(keys=[], functions=[partial_min(i#10)], output=[min#15])
+- *(1) Project [i#10]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
HashAggregate(keys=[], functions=[min(i#10)], output=[min(i)#9])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=30]
+- HashAggregate(keys=[], functions=[partial_min(i#10)], output=[min#15])
+- Project [i#10]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- Scan OneRowRelation[]
```
After introducing the canonicalization, the plan is as following, where you can see **ReusedSubquery**
```
== Parsed Logical Plan ==
'Project [scalar-subquery#2 [] AS max_i#3, scalar-subquery#4 [] AS min_i#5]
: :- 'Project [unresolvedalias('max('i))]
: : +- 'UnresolvedRelation [df], [], false
: +- 'Project [unresolvedalias('min('i))]
: +- 'UnresolvedRelation [df], [], false
+- OneRowRelation
== Analyzed Logical Plan ==
max_i: int, min_i: int
Project [scalar-subquery#2 [] AS max_i#3, scalar-subquery#4 [] AS min_i#5]
: :- Aggregate [max(i#0) AS max(i)#7]
: : +- SubqueryAlias df
: : +- View (`df`, [i#0, j#1])
: : +- RelationV2[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
: +- Aggregate [min(i#10) AS min(i)#9]
: +- SubqueryAlias df
: +- View (`df`, [i#10, j#11])
: +- RelationV2[i#10, j#11] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
+- OneRowRelation
== Optimized Logical Plan ==
Project [scalar-subquery#2 [].max(i) AS max_i#3, scalar-subquery#4 [].min(i) AS min_i#5]
: :- Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
: : +- Aggregate [max(i#0) AS max(i)#7, min(i#0) AS min(i)#9]
: : +- Project [i#0]
: : +- RelationV2[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
: +- Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
: +- Aggregate [max(i#0) AS max(i)#7, min(i#0) AS min(i)#9]
: +- Project [i#0]
: +- RelationV2[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5
+- OneRowRelation
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 0
+- *(1) Project [Subquery subquery#2, [id=#40].max(i) AS max_i#3, ReusedSubquery Subquery subquery#2, [id=#40].min(i) AS min_i#5]
: :- Subquery subquery#2, [id=#40]
: : +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
+- *(2) HashAggregate(keys=[], functions=[max(i#0), min(i#0)], output=[max(i)#7, min(i)#9])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=71]
+- *(1) HashAggregate(keys=[], functions=[partial_max(i#0), partial_min(i#0)], output=[max#16, min#17])
+- *(1) Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
+- HashAggregate(keys=[], functions=[max(i#0), min(i#0)], output=[max(i)#7, min(i)#9])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=22]
+- HashAggregate(keys=[], functions=[partial_max(i#0), partial_min(i#0)], output=[max#16, min#17])
+- Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
: +- ReusedSubquery Subquery subquery#2, [id=#40]
+- *(1) Scan OneRowRelation[]
+- == Initial Plan ==
Project [Subquery subquery#2, [id=#40].max(i) AS max_i#3, Subquery subquery#4, [id=#41].min(i) AS min_i#5]
: :- Subquery subquery#2, [id=#40]
: : +- AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
ResultQueryStage 1
+- *(2) Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
+- *(2) HashAggregate(keys=[], functions=[max(i#0), min(i#0)], output=[max(i)#7, min(i)#9])
+- ShuffleQueryStage 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=71]
+- *(1) HashAggregate(keys=[], functions=[partial_max(i#0), partial_min(i#0)], output=[max#16, min#17])
+- *(1) Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- == Initial Plan ==
Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
+- HashAggregate(keys=[], functions=[max(i#0), min(i#0)], output=[max(i)#7, min(i)#9])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=22]
+- HashAggregate(keys=[], functions=[partial_max(i#0), partial_min(i#0)], output=[max#16, min#17])
+- Project [i#0]
+- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
: +- Subquery subquery#4, [id=#41]
: +- AdaptiveSparkPlan isFinalPlan=false
: +- Project [named_struct(max(i), max(i)#7, min(i), min(i)#9) AS mergedValue#14]
: +- HashAggregate(keys=[], functions=[max(i#0), min(i#0)], output=[max(i)#7, min(i)#9])
: +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=37]
: +- HashAggregate(keys=[], functions=[partial_max(i#0), partial_min(i#0)], output=[max#16, min#17])
: +- Project [i#0]
: +- BatchScan class org.apache.spark.sql.connector.SimpleDataSourceV2$$anon$5[i#0, j#1] class org.apache.spark.sql.connector.SimpleDataSourceV2$MyScanBuilder RuntimeFilters: []
+- Scan OneRowRelation[]
```
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#52529 from yhuang-db/scan-canonicalization.
Authored-by: yhuang-db <[email protected]>
Signed-off-by: Peter Toth <[email protected]>1 parent 928f253 commit 0732e44
File tree
2 files changed
+98
-0
lines changed- sql
- catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2
- core/src/test/scala/org/apache/spark/sql/connector
2 files changed
+98
-0
lines changedLines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
175 | 176 | | |
176 | 177 | | |
177 | 178 | | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
178 | 188 | | |
179 | 189 | | |
180 | 190 | | |
| |||
Lines changed: 88 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
| |||
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| 41 | + | |
39 | 42 | | |
40 | 43 | | |
41 | 44 | | |
| |||
976 | 979 | | |
977 | 980 | | |
978 | 981 | | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
979 | 1055 | | |
980 | 1056 | | |
981 | 1057 | | |
| |||
1081 | 1157 | | |
1082 | 1158 | | |
1083 | 1159 | | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
1084 | 1172 | | |
1085 | 1173 | | |
1086 | 1174 | | |
| |||
0 commit comments