Skip to content

Conversation

@MBkkt
Copy link
Collaborator

@MBkkt MBkkt commented Oct 12, 2025

If any questions "why", please ask, I spent some time in trying to understand what current code wants and what is actually written.

In short I did following:

  • Remove DistributionType::numPartitions, because it was unused and confusing
  • Remove DistributionType::mode, because it was unused and confusing
  • Remove isSamePartition because it was incorrect for broadcast at least
  • Add needsShuffle instead of isSamePartition, it will work correct for broadcast and will allow more cases
  • Replace isSameOrder with needsSort, now if something has sort over "a, b" we don't need to resort for sort "a"
  • Fix plan enumeration and shuffle costs they're both was incorrect

Some Notes:

All current tests are passed except few with incorrect join order expected in the tests, such tests adjusted

In most places Cost::cost used as plan cost, plan input cardinality is always 1, so we can omit inputCardinality from computation for them. But in such case function will became invalid for relation. If needed we can have two functions

Currently input/result cardinality doesn't handle something like select count(*) from (select 1 limit 0)
Because cardinality stored as fanout and multiply doesn't handle this case in a good way.
I have idea how to fix this, but it will be done in the next PR (because requires some refactoring in how we process cost)

// select count(*) from (select 1 limit 0)
// i -- input cardinality
// f -- fanout
// r -- result cardinality (i * f) 
// value scan  i=1 f=1 r=1
// project     i=1 f=1 r=1
// limit       i=1 f=0 r=0
// aggregation i=0 f=? r=1 (now f will be NaN, because divide by zero)

Follow ups:

  1. Now shuffle cost for broadcast, gather and partition is same, I'm not sure it's right, especially for broadcast
    • As workaround we can multiply broadcast shuffle cost part by number of workers.
  2. I'm not sure how velox works, do we keep order of rows after broadcast? If yes, we can keep everything in Distribution::broadcast and only replace DistributionType. It can allow us to enumerate some interesting plans in future, but I think it can be bad in terms of choice of best plan before we fix 1.

@MBkkt MBkkt requested a review from mbasmanova October 12, 2025 03:29
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 12, 2025
@MBkkt
Copy link
Collaborator Author

MBkkt commented Oct 12, 2025

CC @pashandor789

@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 2 times, most recently from 56364bc to 09bcab9 Compare October 13, 2025 20:10
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 3 times, most recently from 2b6dc1f to aceabf7 Compare October 14, 2025 21:04
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of changes. I'm not sure how to process these. Is there a way to break up the PR into smaller pieces?

/// is equal (==) to the other locus. A Locus is referenced by raw pointer and
/// may be allocated from outside the optimization arena. It is immutable and
/// lives past the optimizer arena.
class Locus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you removing this class? It is not mentioned in the PR description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@MBkkt
Copy link
Collaborator Author

MBkkt commented Oct 14, 2025

@mbasmanova

This is a lot of changes. I'm not sure how to process these. Is there a way to break up the PR into smaller pieces?

Not really a lof, it's 3 hundreds, not 3 thousands, unfortunately I'm not sure how to make it simpler.
I think there are few things that can be separated, like

  • Locus, numPartitions, mode removal
  • Use current code to add layouts as they supposed to be added (I'm not sure how wrong isSamePartition, etc will react on this)
  • fix for missed unnest inputCardinality
  • some refactoring lines, e.g. pop_back vs erase, etc (some of them requires "fix missed unnest inputCardinality")

But I also think main complexity of this PR, not in these small parts, but in -- replace isSamePartition and isSameOrder with needsShuffle and needsSort + change plan enumeration to be correct and use to these new methods in a correct way

Do you sure separation to such pieces will help you to process them?
If yes, I will do this.

In general I think there's still issues with PlanState (cost in PlanState isn't always corresponds to "current Plan"), but I plan to fix this separately, otherwise PR will be thousands loc...
For an example such asserts make sense logically, but fails with main (and this branch) code and tests.
image

@MBkkt MBkkt requested a review from mbasmanova October 14, 2025 21:20
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch from aceabf7 to a9f30dd Compare October 14, 2025 21:35
@mbasmanova
Copy link
Contributor

VELOX_CHECK_EQ(cost.inputCardinality, 1) is incorrect. This is correct only for the leaf RelationOp (scan).

I'd like to first extend TableWrite to add support for writing partitioned tables.

@mbasmanova
Copy link
Contributor

Remove Locus because it's unused now, and if partition will be connector dependent we won't need this

Can this be a separate PR?

Use current connector interface to add all connector table layouts

Would you explain this a bit more? Can this be a separate PR?

@MBkkt
Copy link
Collaborator Author

MBkkt commented Oct 14, 2025

@mbasmanova

VELOX_CHECK_EQ(cost.inputCardinality, 1) is incorrect. This is correct only for the leaf RelationOp (scan).

I think it's correct because cost is partial plan cost, not relation operator cost.
Plan cannot starts with inputCardinality != 1

I think main confusion with current code that Cost is actually 2 types of cost:

  1. relation operator cost => here is everything is oblivious
  2. plan cost => here things are a little different:
    • inputCardinality is always 1
    • fanout is result cardinality (so in most cases it can be multiply of relation operators fanouts)
    • setupCost -- sum of all setup costs
    • unitCost -- sum of all (whole operator cost minus operator setup cost)

Hope this helps

@MBkkt
Copy link
Collaborator Author

MBkkt commented Oct 14, 2025

Can this be a separate PR?

#503

Would you explain this a bit more? Can this be a separate PR?

#504

@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 4 times, most recently from 8250415 to bd26d1d Compare October 16, 2025 06:02
@MBkkt MBkkt changed the title fix: Fix distribution, order, layouts and plan enumeration fix: Fix distribution and plan enumeration Oct 16, 2025
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 3 times, most recently from c59fb5a to 848d6dc Compare October 17, 2025 11:07
@mbasmanova
Copy link
Contributor

@MBkkt I'm not able to parse this change. Would it be possible to describe a problem being fixes and how it is fixed? If the original problem is too messy to explain, perhaps, describe the new design you propose and how it works in all different scenarios?

@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 3 times, most recently from c7d319b to 0c9072e Compare October 19, 2025 00:15
@MBkkt
Copy link
Collaborator Author

MBkkt commented Oct 19, 2025

@mbasmanova tpch.q9 in this PR starts to fail because of this #527
Can you look and give me advice what is to do?

Comment on lines +314 to +356
if (isWorse(old)) {
// Remove old plan, it is worse than the new one in all aspects.
queryCtx()->optimization()->trace(
OptimizerOptions::kExceededBest, state.dt->id(), old.cost, *old.op);
std::swap(plans[i], plans.back());
plans.pop_back();
--i;
found = kFoundWorse;
} else if (found == kNone && isBetter(old)) {
// Old plan is better than the new one in all aspects.
found = kFoundBetter;
}
}
if (found == kFoundBetter) {
// No existing plan was worse than the new one in all aspects,
// and at least one existing plan is better than the new one in all aspects.
// So don't add the new plan.
return nullptr;
}
Copy link
Collaborator Author

@MBkkt MBkkt Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I think one of main changes is this part of code. Everything else is related continuation.

What is doing new code?

We want to add new plan
In first we remove all old plans that is worse than new.
And if there's no such plans, and exists at least one old plan that is better than new, we don't need to add new plan.
Also we should account that plans can have different distribution (order and partition).
If order is not compatible, we cannot say one is worse or better.
If partition is not compatible, we need to account cost of reshuffle plan.

Old code

Did something strange?

  1. It didn't remove worse plans, it replaced only one of them
  2. It used incorrect checks to detect isBetter plan, like !old->isStateBetter(state, -shuffleCostPerRow))
  3. It used incorrect code to check distribution is same.
  4. It used incorrect cost computation (now it's formula written single time in single place)

@MBkkt MBkkt linked an issue Oct 23, 2025 that may be closed by this pull request
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch 2 times, most recently from 079b039 to f9f8fca Compare October 27, 2025 18:16
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch from f9f8fca to bbe428f Compare November 1, 2025 12:54
@MBkkt MBkkt force-pushed the mbkkt/fix-distribution branch from bbe428f to 2a41fae Compare November 8, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NextJoin::isWorse logic is incorrect

2 participants