Skip to content

Add Mooncake DataProto transfer backend#469

Open
zxpdemonio wants to merge 2 commits into
alibaba:mainfrom
zxpdemonio:cruz/mooncake_transfer
Open

Add Mooncake DataProto transfer backend#469
zxpdemonio wants to merge 2 commits into
alibaba:mainfrom
zxpdemonio:cruz/mooncake_transfer

Conversation

@zxpdemonio

Copy link
Copy Markdown

Summary

This PR adds Mooncake as an optional ROLL transfer backend for structured DataProto transfer.

The implementation keeps ROLL-side integration aligned with the existing transfer backend boundary:

  • ROLL keeps transfer_backend.put/get/delete and RemoteBatch semantics.
  • Mooncake owns the structured object transfer implementation.
  • Mooncake is integrated through the standard mooncake package APIs.
  • The public transfer_backend.put(...) API remains unchanged.

What Changed

Mooncake Backend

  • Added Mooncake backend support using mooncake.structured_object_store.MooncakeBundleTransfer.
  • Stores ROLL DataProto-style payloads as Mooncake structured objects.
  • Supports:
    • tensor batch fields
    • non_tensor_batch
    • mixed tensor/non-tensor payloads
    • lazy field materialization through ColumnRemoteBatch
    • cleanup through Mooncake DataProto handle cleanup
  • Keeps Mooncake as an optional backend selected by config.

ROLL Compatibility

  • Preserved the existing four-argument transfer_backend.put(partition, row_ids, fields, batch_size) API.
  • Did not add Mooncake-specific kwargs to the shared transfer backend interface.
  • Existing TransferQueue behavior remains unchanged.
  • ROLL continues to own RemoteBatch / ColumnRemoteBatch semantics.

Node-scoped Client

  • Keeps node-scoped Mooncake client support.
  • This avoids initializing Mooncake resources independently in every worker process when node-level reuse is preferred.

Documentation

  • Updated RemoteBatch transfer documentation.
  • Documented Mooncake as an optional structured DataProto backend.
  • Added Mooncake backend configuration example.
  • Updated English and Chinese docs.

Tests

  • Rewrote Mooncake transfer tests without fake/mock Mooncake modules.
  • Added real RDMA-backed Mooncake round-trip coverage.
  • Test uses standard mooncake command/package names and standard Mooncake environment variables.
  • No hardcoded local repository paths are included in the test file.

Validation

PATH=/root/Mooncake-PR2050/build/mooncake-store/src:$PATH \
PYTHONPATH=/root/Mooncake-PR2050/mooncake-wheel:$PYTHONPATH \
MOONCAKE_MASTER=192.168.22.70:50051 \
MOONCAKE_LOCAL_HOSTNAME=192.168.22.70 \
MOONCAKE_PROTOCOL=rdma \
MOONCAKE_DEVICE_NAME=erdma_0 \
/root/roll/.venv/bin/python -m pytest -q \
  tests/distributed/scheduler/test_mooncake_transfer_backend.py

Result:

5 passed

Additional checks:

python -m py_compile
roll/distributed/scheduler/transfer_backend.py
tests/distributed/scheduler/test_mooncake_transfer_backend.py

git diff --check

zxpdemonio and others added 2 commits June 29, 2026 19:14
Wire Mooncake into the existing DataProto transfer backend path with a node-scoped client by default to reuse per-node store setup and registered buffer pools, while keeping process-local clients configurable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Mooncake structured object transfer as the optional DataProto backend while keeping ROLL's existing transfer_backend.put API and RemoteBatch semantics. Add real RDMA-backed Mooncake tests without fake Mooncake modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants