Skip to content

Conversation

@EnricoMi
Copy link

@EnricoMi EnricoMi commented Jun 12, 2025

What changes were proposed in this pull request?

On the presence of a fallback storage, ShuffleBlockFetcherIterator can optimistically try to read a block from the fallback storage, as it might have been migrated from a decommissioned executor to the fallback storage.

Note: This optimistic attempt to find the missing shuffle data on the fallback storage would collide with some replication delay handled in #16.

Why are the changes needed?

In a kubernetes environment, executors may be decommissioned. With a fallback storage configured, shuffle data will be migrated to other executors or the fallback storage. Tasks that start during a decommissioning phase of another executor might read blocks from that executor after it has been decommissioned. The task does not know the new location of the migrated block. Given a fallback storage is configured, it could optimistically try to read the block from the fallback storage.

This avoids a stage retry, which otherwise is an expensive way to fetch the current block address after a block migration.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual testing in a kubernetes setup.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jun 12, 2025
@EnricoMi EnricoMi changed the title Attempt to read missing block from fallback storage [SPARK-52507][K8S] Attempt to read missing block from fallback storage Jun 17, 2025
@EnricoMi EnricoMi force-pushed the fallback-storage-retry-from-fallback branch from b7c2890 to 8b072b3 Compare September 2, 2025 05:08
@EnricoMi EnricoMi force-pushed the fallback-storage-retry-from-fallback branch from f18110f to 70ab21e Compare November 24, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants