Skip to content

[Feat] Add Cache IO aggregation for Ascend#1009

Open
NaganooMei wants to merge 1 commit into
ModelEngine-Group:developfrom
NaganooMei:io-aggregation-upstream-pr
Open

[Feat] Add Cache IO aggregation for Ascend#1009
NaganooMei wants to merge 1 commit into
ModelEngine-Group:developfrom
NaganooMei:io-aggregation-upstream-pr

Conversation

@NaganooMei

@NaganooMei NaganooMei commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add an Ascend Cache IO aggregation executor for shard-level H2D and D2H transfers.
  • Add the Ascend shard IO aggregator and FFTS D2D dispatcher used by the aggregated executor.
  • Wire CacheStore load and dump queues through the new executor abstraction while keeping the tensor executor as the default path.
  • Add build and setup switches so upstream builds do not compile IO aggregation unless explicitly enabled.
  • load path:
image ## Validation
Scenario Baseline With IO Aggregation Result
GQA model, 32K IO load/dump ~4 GB/s per device × 8 ~22 GB/s per device × 8 ~5.75× bandwidth improvement
DeepSeek-V4 CacheStore IO ~3–4 GB/s per device × 8 ~8–10 GB/s per device × 8 ~2–3× bandwidth improvement
Default upstream build IO aggregation disabled Not compiled unless explicitly enabled Default tensor executor path preserved
Code check git diff --check upstream/develop..HEAD Passed No whitespace/check-format issues

@NaganooMei NaganooMei force-pushed the io-aggregation-upstream-pr branch 4 times, most recently from 6f5fc11 to c1efb06 Compare June 9, 2026 08:04
@NaganooMei NaganooMei marked this pull request as ready for review June 9, 2026 08:51
@NaganooMei NaganooMei changed the title Add Cache IO aggregation for Ascend [Feat] Add Cache IO aggregation for Ascend Jun 9, 2026
@NaganooMei NaganooMei force-pushed the io-aggregation-upstream-pr branch from c1efb06 to f40e380 Compare June 9, 2026 08:58

void AscendShardIOAggregator::Cleanup() noexcept
{
if (deviceId_ >= 0) { (void)aclrtSetDevice(deviceId_); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: After aclrtSetDevice, the code should check for errors before proceeding. The current implementation checks this, which is good.


Status FftsD2DDispatcher::AddMemcpy(void* dst, const void* src, size_t size)
{
if (completed_) { return Status::Error("FFTS dispatcher already launched"); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: The completed_ flag prevents adding new copies after launch. This is a good safety check. Consider also resetting this flag in Reset() to allow reuse of the dispatcher.

@NaganooMei NaganooMei force-pushed the io-aggregation-upstream-pr branch from f40e380 to d48eb09 Compare June 11, 2026 08:17
public:
Status Setup(const Config&) override
{
return Status::InvalidParam("Cache IO aggregation is not compiled");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: The UnavailableAggregatedIOExecutor returns Status::InvalidParam for all methods when IO aggregation is not compiled. Since this is not a parameter validation error (the parameters may be perfectly valid), consider using Status::Unsupported() instead. This better communicates that the feature is simply not available in this build configuration.

The Status::Unsupported() method is available in ucm/shared/infra/status/status.h and maps to error code -50008, which the troubleshooting documentation describes as "Feature or operation not supported in current context" - exactly matching this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants