Home

Problem To Solve

Apache Spark community is working on storing shuffle data on remote storage (SPARK-25299, Discussion Document). This Remote Shuffle Service demonstrates one implementation by streaming shuffle data to shuffle servers which run on separate machines.

Design Options

There are multiple options to implement a shuffle service storing data on remote storage. Following are some examples (not comprehensive list of all possible solutions):

Streaming Based Remote Shuffle Service: shuffle clients send shuffle records to remote shuffle service in a streaming style, and shuffle service writes the records to different partition files based on the records’ partition id.
Async Shuffle File Upload: shuffle clients write shuffle records to local disk first, then upload/merge shuffle files to remote storage asynchronously.

This project focuses on the first option to illustrate how to design and implement a streaming based remote shuffle service. See this High Level Design Doc for a quick overview.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Home

Problem To Solve

Design Options

Clone this wiki locally