You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bo Yang edited this page Oct 14, 2020
·
2 revisions
Problem To Solve
Apache Spark community is working on storing shuffle data on remote storage (SPARK-25299, Discussion Document). This Remote Shuffle Service demonstrates one implementation by streaming shuffle data to shuffle servers which run on separate machines.
Design Options
There are multiple options to implement a shuffle service storing data on remote storage. Following are some examples (not comprehensive list of all possible solutions):
Streaming Based Remote Shuffle Service: shuffle clients send shuffle records to remote shuffle service in a streaming style, and shuffle service writes the records to different partition files based on the records’ partition id.
Async Shuffle File Upload: shuffle clients write shuffle records to local disk first, then upload/merge shuffle files to remote storage asynchronously.
This project focuses on the first option to illustrate how to design and implement a streaming based remote shuffle service. See this High Level Design Doc for a quick overview.