Skip to content

[Epic] Adaptative Query Execution #377

@gabotechs

Description

@gabotechs

The TaskEstimator API currently performs a very rough estimation of an appropriate number of tasks for the different stages of a query. It's mechanism relies on some details that are typically flawed:

  • How well can a user of this library estimate the number of tasks for a leaf node at planning time.
  • How accurate is to increase/decrease the task count just because a node is "reducing" or not the cardinality of the data.

Because of this, an attempt to ship a cost-based task estimator was attempted in #311, which aimed to rely on the standard DataFusion stats propagation machinery to estimate how many rows are going to be flowing through the nodes. There are several issues with that:

As a consequence, Cost based planning is not capable of demonstrating any kind of improvement.

I don't think there's any technical limitation preventing this project from implementing adaptative query execution, it has been mentioned before #171, and with the amount of uncertainty there is for all attempted stats-based approaches, I think it might be worth trying to ship some form of adaptative query execution to this project.


This effort can be split in several deliverables, which intend to add value on their own even if we decide that adaptative query execution is not a good idea:

Also, special mention to this one, that is tangentially related with future API evolutions:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions