-
Notifications
You must be signed in to change notification settings - Fork 34
[Epic] Adaptative Query Execution #377
Description
The TaskEstimator API currently performs a very rough estimation of an appropriate number of tasks for the different stages of a query. It's mechanism relies on some details that are typically flawed:
- How well can a user of this library estimate the number of tasks for a leaf node at planning time.
- How accurate is to increase/decrease the task count just because a node is "reducing" or not the cardinality of the data.
Because of this, an attempt to ship a cost-based task estimator was attempted in #311, which aimed to rely on the standard DataFusion stats propagation machinery to estimate how many rows are going to be flowing through the nodes. There are several issues with that:
- It still relies on users being able to provide accurate stats at planning time, which might be true if reading parquet files, but might not be true if the data source is an API call over the network or similar.
- It assumes a correct stats propagation in upstream DataFusion, which unfortunately is far from being the case. There are several efforts for improving this situation in DataFusion (Add statistics integration tests apache/datafusion#20292, Epic: Statistics improvements apache/datafusion#8227, EPIC: Making use of NDVs (number of distinct values) in DataFusion apache/datafusion#20766, etc...)
As a consequence, Cost based planning is not capable of demonstrating any kind of improvement.
I don't think there's any technical limitation preventing this project from implementing adaptative query execution, it has been mentioned before #171, and with the amount of uncertainty there is for all attempted stats-based approaches, I think it might be worth trying to ship some form of adaptative query execution to this project.
This effort can be split in several deliverables, which intend to add value on their own even if we decide that adaptative query execution is not a good idea:
- Use a dedicated gRPC service for workers instead of Arrow Flight #373
- Asynchronous distributed planning #382
- Stream metadata from DistributedExec to leaf nodes #374
- Lazily choose the amount of tasks based on sampled data #376
Also, special mention to this one, that is tangentially related with future API evolutions: