[Epic] Adaptative Query Execution

The `TaskEstimator` API currently performs a very rough estimation of an appropriate number of tasks for the different stages of a query. It's mechanism relies on some details that are typically flawed:
- How well can a user of this library estimate the number of tasks for a leaf node at planning time.
- How accurate is to increase/decrease the task count just because a node is "reducing" or not the cardinality of the data.

Because of this, an attempt to ship a cost-based task estimator was attempted in https://github.com/datafusion-contrib/datafusion-distributed/pull/311, which aimed to rely on the standard DataFusion stats propagation machinery to estimate how many rows are going to be flowing through the nodes. There are several issues with that:
- It still relies on users being able to provide accurate stats at planning time, which might be true if reading parquet files, but might not be true if the data source is an API call over the network or similar.
- It assumes a correct stats propagation in upstream DataFusion, which unfortunately is far from being the case. There are several efforts for improving this situation in DataFusion (https://github.com/apache/datafusion/pull/20292, https://github.com/apache/datafusion/issues/8227, https://github.com/apache/datafusion/issues/20766, etc...)

As a consequence, [Cost based planning](https://github.com/datafusion-contrib/datafusion-distributed/pull/311) is not capable of demonstrating any kind of improvement.

I don't think there's any technical limitation preventing this project from implementing adaptative query execution, it has been mentioned before https://github.com/datafusion-contrib/datafusion-distributed/issues/171, and with the amount of uncertainty there is for all attempted stats-based approaches, I think it might be worth trying to ship some form of adaptative query execution to this project.

---

This effort can be split in several deliverables, which intend to add value on their own even if we decide that adaptative query execution is not a good idea:
- [x] https://github.com/datafusion-contrib/datafusion-distributed/issues/373
- [ ] https://github.com/datafusion-contrib/datafusion-distributed/issues/382
- [ ] https://github.com/datafusion-contrib/datafusion-distributed/issues/374
- [ ] https://github.com/datafusion-contrib/datafusion-distributed/issues/376

Also, special mention to this one, that is tangentially related with future API evolutions:
- https://github.com/datafusion-contrib/datafusion-distributed/issues/378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Adaptative Query Execution #377

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic] Adaptative Query Execution #377

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions