Skip to content

[WIP] [Feat] Ffts D2D transport dev#990

Open
pyxyzc wants to merge 7 commits into
ModelEngine-Group:feature_26h1from
pyxyzc:feature_hbm_kv_buffer
Open

[WIP] [Feat] Ffts D2D transport dev#990
pyxyzc wants to merge 7 commits into
ModelEngine-Group:feature_26h1from
pyxyzc:feature_hbm_kv_buffer

Conversation

@pyxyzc

@pyxyzc pyxyzc commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Purpose

Introduce an Ascend FFTS Plus based D2D transport module for P2P cache data movement, providing an asynchronous copy interface backed by ACL stream execution and FFTS SDMA task submission.

Modifications

  • Add ucm_transport_ffts static library under ucm/transport/p2p/ffts.
  • Add CMake discovery and linkage for Ascend CANN headers/libraries, including ascendcl, runtime, FFTS runtime headers, profiling headers, and toolchain headers.
  • Add public FftsTransport API with Setup, WaitEvent, CopyAsync, batched Submit, and Synchronize.
  • Implement FFTS engine logic to:
    • create and manage an ACL fast-launch/fast-sync stream;
    • convert copy descriptors into FFTS Plus SDMA contexts;
    • split large batches by the FFTS ready-context limit;
    • filter no-op copies and validate invalid descriptors;
    • keep submitted context buffers alive until stream synchronization.

Test

FFTS transport tests covering:

  • Copy before setup: verifies copy requests fail before Setup() is called.
  • Single device copy: verifies one device-to-device copy preserves the source data.
  • Batched copies: verifies multiple chunked copy descriptors are submitted and completed correctly.
  • Large batch split: verifies large submissions are split internally and all chunks are copied.
  • Two-stage pipeline: verifies data can be copied through an intermediate device buffer.
  • Invalid/no-op inputs: verifies zero-size operations succeed as no-ops and invalid non-zero operations fail.

@pyxyzc pyxyzc requested review from Fengli5355 and nrj868 June 1, 2026 12:14
@pyxyzc pyxyzc changed the title [Feat] Ffts D2D transport dev [WIP] [Feat] Ffts D2D transport dev Jun 1, 2026
Comment thread ucm/transport/p2p/ffts/src/ffts_engine.cpp
Comment thread ucm/transport/p2p/ffts/src/ffts_engine.cpp
Comment thread ucm/transport/p2p/ffts/src/ffts_engine.cpp

void FftsEngine::KeepAlive(std::shared_ptr<ContextBuffer> contexts)
{
pendingContexts_.emplace_back(std::move(contexts));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Memory Consideration: pendingContexts_ grows without bound if Synchronize() is not called frequently. Consider adding a size limit or a cleanup mechanism to prevent unbounded memory growth in long-running operations.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an Ascend FFTS Plus-based device-to-device transport library (ucm_transport_ffts) under ucm/transport/p2p/ffts. The transport exposes an asynchronous copy API that wraps an ACL fast-launch/fast-sync stream and submits SDMA contexts via rtFftsPlusTaskLaunchWithFlag.

Changes:

  • Add public FftsTransport API (Setup, WaitEvent, CopyAsync, batched Submit, Synchronize) and a private FftsEngine that builds SDMA contexts, splits batches by a 128 ready-context limit, and keeps context buffers alive until stream sync.
  • Add CMake discovery for Ascend CANN headers (acl, runtime, profiling, toolchain) and libraries (ascendcl, runtime), gated on RUNTIME_ENVIRONMENT == ascend.
  • Add gtest-based unit tests covering setup ordering, single/batched copies, large-batch splitting, two-stage pipelining, and invalid/no-op input handling.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ucm/transport/p2p/CMakeLists.txt Conditionally include ffts subdirectory on Ascend builds.
ucm/transport/p2p/ffts/CMakeLists.txt Locate Ascend CANN headers/libs and define the ucm_transport_ffts static library.
ucm/transport/p2p/ffts/include/ffts_transport.h Public API: CopyDesc plus FftsTransport pimpl façade.
ucm/transport/p2p/ffts/detail/ffts_engine.h Internal FftsEngine declaration with stream/state and pending context buffer storage.
ucm/transport/p2p/ffts/src/ffts_transport.cpp Pimpl implementation forwarding to FftsEngine.
ucm/transport/p2p/ffts/src/ffts_engine.cpp Engine logic: stream setup, SDMA context construction, FFTS Plus task launch, batch splitting, lifetime management.
ucm/transport/p2p/ffts/test/CMakeLists.txt Build wiring for the gtest test binary.
ucm/transport/p2p/ffts/test/ffts_transport_test.cpp Functional tests for the FFTS transport.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FftsEngine(FftsEngine&&) = delete;
FftsEngine& operator=(FftsEngine&&) = delete;

Status Setup(int32_t deviceId);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to also have a teardown/shutdown for this class?

@pyxyzc pyxyzc Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAII cleanup is sufficient for the current FftsEngine. The destructor already releases pending context buffers and destroys the ACL stream. Global ACL runtime and device lifecycle are intentionally not managed by FftsEngine.

Comment thread ucm/transport/p2p/ffts/src/ffts_engine.cpp
{
std::vector<size_t> result;
size_t begin = 0;
while (begin <= value.size()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Boundary Condition: The loop condition while (begin <= value.size()) could potentially cause issues. When begin == value.size(), the subsequent value.substr(begin, ...) would return an empty string, which throws an exception. While this works due to the break statement, the condition should ideally be while (begin < value.size()) for clarity and to avoid the edge case entirely. The current logic relies on the exception to handle the boundary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants