Skip to content

feat: remove full path from partition locations#1527

Merged
milenkovicm merged 16 commits intoapache:mainfrom
milenkovicm:feat_remove_partition
Apr 4, 2026
Merged

feat: remove full path from partition locations#1527
milenkovicm merged 16 commits intoapache:mainfrom
milenkovicm:feat_remove_partition

Conversation

@milenkovicm
Copy link
Copy Markdown
Contributor

@milenkovicm milenkovicm commented Mar 29, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

  • at the moment full path is used to access shuffle files, having full path in each partition location brings quite overhead.
  • absolute path is a bio of security issue, probably low but still

What changes are included in this PR?

  • remove full path from partition locations, generate it using available information
  • shuffle reader will have executor path injected in the same way like shuffle writer

there are going to be follow up changes after this one, adding subdiroctory to store shuffle files and session id between work dir and job

Are there any user-facing changes?

  • no

@milenkovicm milenkovicm marked this pull request as draft March 29, 2026 09:38
@milenkovicm milenkovicm force-pushed the feat_remove_partition branch 6 times, most recently from f4ae4e9 to 15839b9 Compare March 29, 2026 10:48
@milenkovicm milenkovicm requested a review from martin-g March 29, 2026 10:50
@milenkovicm milenkovicm force-pushed the feat_remove_partition branch from 15839b9 to 3903863 Compare March 29, 2026 11:02
@milenkovicm milenkovicm marked this pull request as ready for review March 29, 2026 11:18
@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 3 --scale-factor 1

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 3903863a
Base Commit: 97a74521 (main)
Scale Factor: SF1
Iterations: 3

Query Comparison

Query main (ms) PR (ms) Change
Q1 209.80 206.30 ⚪ -1.7%
Q2 99.50 100.50 ⚪ +1.0%
Q3 135.10 136.00 ⚪ +0.7%
Q4 75.80 74.50 ⚪ -1.7%
Q5 285.90 273.00 ⚪ -4.5%
Q6 60.60 61.40 ⚪ +1.3%
Q7 367.20 368.70 ⚪ +0.4%
Q8 326.10 316.50 ⚪ -2.9%
Q9 454.50 448.60 ⚪ -1.3%
Q10 165.70 155.50 🟢 -6.2%
Q11 71.10 69.50 ⚪ -2.3%
Q12 109.50 106.70 ⚪ -2.6%
Q13 128.90 125.50 ⚪ -2.6%
Q14 62.80 62.30 ⚪ -0.8%
Q15 62.90 61.70 ⚪ -1.9%
Q16 69.40 53.70 🟢 -22.6%
Q17 214.50 209.40 ⚪ -2.4%
Q18 346.80 311.70 🟢 -10.1%
Q19 150.00 142.90 ⚪ -4.7%
Q20 109.90 97.50 🟢 -11.3%
Q21 486.00 443.70 🟢 -8.7%
Q22 49.30 60.30 🔴 +22.3%

Total: Main=4041.30ms, PR=3885.90ms (-3.8%)


Automated benchmark run by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 3 --scale-factor 10

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 3903863a
Base Commit: 97a74521 (main)
Scale Factor: SF10
Iterations: 3

Query Comparison

Query main (ms) PR (ms) Change
Q1 1906.40 1923.10 ⚪ +0.9%
Q2 755.90 751.90 ⚪ -0.5%
Q3 1206.30 1248.70 ⚪ +3.5%
Q4 573.90 583.70 ⚪ +1.7%
Q5 2581.60 2719.20 🔴 +5.3%
Q6 524.90 518.50 ⚪ -1.2%
Q7 3698.60 3766.60 ⚪ +1.8%
Q8 3332.60 3178.00 ⚪ -4.6%
Q9 4651.50 4701.70 ⚪ +1.1%
Q10 1488.10 1459.10 ⚪ -1.9%
Q11 541.50 536.40 ⚪ -0.9%
Q12 985.00 967.00 ⚪ -1.8%
Q13 1367.10 1319.30 ⚪ -3.5%
Q14 556.90 554.40 ⚪ -0.4%
Q15 549.20 543.40 ⚪ -1.1%
Q16 342.90 347.10 ⚪ +1.2%
Q17 3398.50 3392.70 ⚪ -0.2%
Q18 4172.30 3836.30 🟢 -8.1%
Q19 1324.40 1261.30 ⚪ -4.8%
Q20 946.80 955.80 ⚪ +1.0%
Q21 4771.90 4370.20 🟢 -8.4%
Q22 307.70 291.90 🟢 -5.1%

Total: Main=39984.00ms, PR=39226.30ms (-1.9%)


Automated benchmark run by dfbench

@milenkovicm milenkovicm force-pushed the feat_remove_partition branch from 3903863 to 0edddcb Compare March 29, 2026 12:09
@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 5 --scale-factor 1

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 0edddcba
Base Commit: 97a74521 (main)
Scale Factor: SF1
Iterations: 5

Query Comparison

Query main (ms) PR (ms) Change
Q1 212.30 215.80 ⚪ +1.6%
Q2 100.00 123.20 🔴 +23.2%
Q3 137.90 140.70 ⚪ +2.0%
Q4 76.80 77.20 ⚪ +0.5%
Q5 272.80 289.20 🔴 +6.0%
Q6 60.10 60.40 ⚪ +0.5%
Q7 371.50 368.00 ⚪ -0.9%
Q8 316.10 345.70 🔴 +9.4%
Q9 460.40 481.50 ⚪ +4.6%
Q10 160.80 167.80 ⚪ +4.4%
Q11 70.90 70.80 ⚪ -0.1%
Q12 109.20 109.50 ⚪ +0.3%
Q13 130.80 135.20 ⚪ +3.4%
Q14 64.10 68.20 🔴 +6.4%
Q15 61.10 63.50 ⚪ +3.9%
Q16 64.40 65.80 ⚪ +2.2%
Q17 232.10 224.70 ⚪ -3.2%
Q18 413.10 394.10 ⚪ -4.6%
Q19 151.00 145.80 ⚪ -3.4%
Q20 99.40 102.70 ⚪ +3.3%
Q21 428.10 420.40 ⚪ -1.8%
Q22 52.50 48.00 🟢 -8.6%

Total: Main=4045.40ms, PR=4118.20ms (+1.8%)


Automated benchmark run by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 5 --scale-factor 10

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 0edddcba
Base Commit: 97a74521 (main)
Scale Factor: SF10
Iterations: 5

Query Comparison

Query main (ms) PR (ms) Change
Q1 1974.50 1934.60 ⚪ -2.0%
Q2 800.50 765.50 ⚪ -4.4%
Q3 1257.50 1236.50 ⚪ -1.7%
Q4 601.70 580.20 ⚪ -3.6%
Q5 2835.50 2735.00 ⚪ -3.5%
Q6 528.80 527.30 ⚪ -0.3%
Q7 4147.70 3841.10 🟢 -7.4%
Q8 3374.60 3260.10 ⚪ -3.4%
Q9 5025.00 4866.70 ⚪ -3.2%
Q10 1490.40 1510.10 ⚪ +1.3%
Q11 548.30 545.90 ⚪ -0.4%
Q12 1026.90 983.30 ⚪ -4.2%
Q13 1316.30 1261.00 ⚪ -4.2%
Q14 565.70 562.80 ⚪ -0.5%
Q15 576.50 566.50 ⚪ -1.7%
Q16 355.80 349.60 ⚪ -1.7%
Q17 3474.90 3454.80 ⚪ -0.6%
Q18 4134.20 4487.70 🔴 +8.6%
Q19 1280.10 1275.30 ⚪ -0.4%
Q20 977.40 997.10 ⚪ +2.0%
Q21 4600.10 4383.30 ⚪ -4.7%
Q22 319.70 312.50 ⚪ -2.3%

Total: Main=41212.10ms, PR=40436.90ms (-1.9%)


Automated benchmark run by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark help

@sqlbenchmark
Copy link
Copy Markdown

@sqlbenchmark usage

This bot is whitelisted for DataFusion committers.

Commands

@sqlbenchmark tpch [--iterations N] [--scale-factor N] [--baseline BRANCH] [--baseline-pr N] [-c key=value ...]
@sqlbenchmark criterion [--bench NAME] [--filter PATTERN] [--baseline BRANCH] [--baseline-pr N]
@sqlbenchmark help

Examples

@sqlbenchmark tpch
@sqlbenchmark tpch --iterations 5 --scale-factor 10
@sqlbenchmark tpch -c datafusion.execution.batch_size=4096
@sqlbenchmark tpch --baseline my-branch
@sqlbenchmark criterion
@sqlbenchmark criterion --bench sql_planner
@sqlbenchmark criterion --baseline my-branch
@sqlbenchmark criterion --bench sort_shuffle --baseline-pr 1430

Automated by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 3 --scale-factor 1 --baseline branch-51

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 0edddcba
Base Commit: 08544f49 (branch-51)
Scale Factor: SF1
Iterations: 3

Results

Could not parse structured results. Raw output:

Main branch:

    Finished `release` profile [optimized] target(s) in 0.39s
     Running `target/release/tpch benchmark ballista --host localhost --port 50050 --path /app/data --format parquet --iterations 3`
error: The following required arguments were not provided:
    --query <query>

USAGE:
    tpch benchmark ballista --batch-size <batch-size> --format <file-format> --host <host> --iterations <iterations> --partitions <partitions> --path <path> --port <port> --query <query>

For more information try --help

PR branch:

k 110.2 ms and returned 2 rows
Query 12 avg time: 108.9 ms
Query 13 iteration 0 took 141.9 ms and returned 42 rows
Query 13 iteration 1 took 126.0 ms and returned 42 rows
Query 13 iteration 2 took 132.8 ms and returned 42 rows
Query 13 avg time: 133.6 ms
Query 14 iteration 0 took 64.4 ms and returned 1 rows
Query 14 iteration 1 took 61.1 ms and returned 1 rows
Query 14 iteration 2 took 61.5 ms and returned 1 rows
Query 14 avg time: 62.3 ms
Query 15 iteration 0 took 62.3 ms and returned 0 rows
Query 15 iteration 1 took 62.4 ms and returned 0 rows
Query 15 iteration 2 took 61.1 ms and returned 0 rows
Query 15 avg time: 61.9 ms
Query 16 iteration 0 took 56.1 ms and returned 18314 rows
Query 16 iteration 1 took 54.9 ms and returned 18314 rows
Query 16 iteration 2 took 55.0 ms and returned 18314 rows
Query 16 avg time: 55.3 ms
Query 17 iteration 0 took 207.1 ms and returned 1 rows
Query 17 iteration 1 took 210.7 ms and returned 1 rows
Query 17 iteration 2 took 208.9 ms and returned 1 rows
Query 17 avg time: 208.9 ms
Query 18 iteration 0 took 298.2 ms and returned 57 rows
Query 18 iteration 1 took 346.3 ms and returned 57 rows
Query 18 iteration 2 took 338.5 ms and returned 57 rows
Query 18 avg time: 327.7 ms
Query 19 iteration 0 took 137.9 ms and returned 1 rows
Query 19 iteration 1 took 139.8 ms and returned 1 rows
Query 19 iteration 2 took 141.3 ms and returned 1 rows
Query 19 avg time: 139.7 ms
Query 20 iteration 0 took 98.5 ms and returned 186 rows
Query 20 iteration 1 took 93.2 ms and returned 186 rows
Query 20 iteration 2 took 101.8 ms and returned 186 rows
Query 20 avg time: 97.8 ms
Query 21 iteration 0 took 466.3 ms and returned 100 rows
Query 21 iteration 1 took 463.0 ms and returned 100 rows
Query 21 iteration 2 took 408.2 ms and returned 100 rows
Query 21 avg time: 445.8 ms
Query 22 iteration 0 took 48.4 ms and returned 7 rows
Query 22 iteration 1 took 48.9 ms and returned 7 rows
Query 22 iteration 2 took 47.9 ms and returned 7 rows
Query 22 avg time: 48.4 ms


Automated benchmark run by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

@sqlbenchmark tpch --iterations 3 --scale-factor 1 --baseline branch-52

@sqlbenchmark
Copy link
Copy Markdown

Ballista TPC-H Benchmark Results

PR: #1527 - remove partition from partition location
PR Commit: 0edddcba
Base Commit: 3950f0b3 (branch-52)
Scale Factor: SF1
Iterations: 3

Query Comparison

Query branch-52 (ms) PR (ms) Change
Q1 228.20 202.30 🟢 -11.3%
Q2 98.00 98.60 ⚪ +0.6%
Q3 134.10 134.30 ⚪ +0.1%
Q4 73.50 73.60 ⚪ +0.1%
Q5 270.10 270.40 ⚪ +0.1%
Q6 59.50 70.90 🔴 +19.2%
Q7 384.80 359.40 🟢 -6.6%
Q8 316.80 311.60 ⚪ -1.6%
Q9 444.60 437.80 ⚪ -1.5%
Q10 156.00 164.80 🔴 +5.6%
Q11 80.60 70.00 🟢 -13.2%
Q12 110.30 107.40 ⚪ -2.6%
Q13 129.10 127.80 ⚪ -1.0%
Q14 62.10 62.90 ⚪ +1.3%
Q15 57.20 61.00 🔴 +6.6%
Q16 52.20 65.90 🔴 +26.2%
Q17 208.70 206.40 ⚪ -1.1%
Q18 413.60 307.30 🟢 -25.7%
Q19 144.40 137.40 ⚪ -4.8%
Q20 94.90 95.70 ⚪ +0.8%
Q21 416.20 436.60 ⚪ +4.9%
Q22 44.50 57.50 🔴 +29.2%

Total: Main=3979.40ms, PR=3859.60ms (-3.0%)


Automated benchmark run by dfbench

@milenkovicm
Copy link
Copy Markdown
Contributor Author

hey @martin-g any chance you can have a look whenever you get time

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Apr 2, 2026

I'll try to do it tomorrow!

// transfers the entire file, which contains all partitions.
// Use flight transport (do_get) for sort-based shuffle.
if is_sort_shuffle_output(data_path) {
if is_sort_shuffle_output(&path) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this use is_sort_shuffle now ?

partition,
metrics: ExecutionPlanMetricsSet::new(),
properties,
work_dir: "".to_string(), // to be updated at the executor side
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be safer to make it an Option<String> with None as default.
This way if it is not updated later to Some you can fail early.
Currently there is a chance silently to use the current working directory if it is not updated for any reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It copy of shuffle writer approach, but i agree it makes sense

Comment thread ballista/executor/src/flight_service.rs Outdated
Comment thread ballista/core/src/serde/scheduler/mod.rs Outdated
Comment thread ballista/executor/src/flight_service.rs Outdated
Comment thread ballista/core/src/execution_plans/shuffle_reader.rs Outdated
Comment thread ballista/core/src/execution_plans/shuffle_reader.rs Outdated
Comment thread ballista/core/src/execution_plans/shuffle_reader.rs Outdated
Comment thread ballista/executor/src/lib.rs Outdated
// replace this check with open, and check for error
//
// Check if this is a sort-based shuffle output (has index file)
if sort_shuffle_enabled && is_sort_shuffle_output(data_path) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use location.is_sort_shuffle ?

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
milenkovicm and others added 13 commits April 3, 2026 09:10
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@milenkovicm milenkovicm requested a review from martin-g April 3, 2026 11:45
@milenkovicm milenkovicm merged commit 8c6c864 into apache:main Apr 4, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants