Skip to content

[Feat] use host-pinned memory with dual CPU/device addresses for transport buffers#1024

Open
yumingyue624 wants to merge 1 commit into
ModelEngine-Group:feature_26h1from
yumingyue624:adapt_connection
Open

[Feat] use host-pinned memory with dual CPU/device addresses for transport buffers#1024
yumingyue624 wants to merge 1 commit into
ModelEngine-Group:feature_26h1from
yumingyue624:adapt_connection

Conversation

@yumingyue624

Copy link
Copy Markdown
Contributor

Purpose

Switch ASU send/flag buffers from plain host memory to host-pinned
memory so that CPU code packs SQEs through the local mapping while
HCOMM/RDMA uses the device-visible mapping of the same allocation.

Modifications

  1. BufferManager: allocate host-pinned memory via aclrtMallocHost +
    aclrtHostRegisterV2, obtain device pointer via
    aclrtHostGetDevicePointer. ScatterGatherEntry gains device_addr.
    RegisterMemory uses device address for host-pinned regions.
  2. AsuTransportImpl: send/flag buffers use HOST_PINNED instead of HOST.
  3. asu_submit_flow: pass device_addr to SendIoBatch.
  4. sqe_request: use flagBuffer.device_addr for response_buffer_addr.
  5. Tests: added host-pinned dual-address and device_addr assertions.

Test

  • buffer_manager_test: HostPinnedRegistersDeviceAddress.
  • asu_submit_flow_test: BuildSubBatchSendBuffersUsesHostPinnedDeviceAddresses.
  • sqe_request_test: packed response address matches device_addr.

… transport buffers

## Purpose
Switch ASU send/flag buffers from plain host memory to host-pinned
memory so that CPU code packs SQEs through the local mapping while
HCOMM/RDMA uses the device-visible mapping of the same allocation.

## Modifications
1. BufferManager: allocate host-pinned memory via aclrtMallocHost +
aclrtHostRegisterV2, obtain device pointer via
aclrtHostGetDevicePointer. ScatterGatherEntry gains device_addr.
RegisterMemory uses device address for host-pinned regions.
2. AsuTransportImpl: send/flag buffers use HOST_PINNED instead of HOST.
3. asu_submit_flow: pass device_addr to SendIoBatch.
4. sqe_request: use flagBuffer.device_addr for response_buffer_addr.
5. Tests: added host-pinned dual-address and device_addr assertions.

## Test
- buffer_manager_test: HostPinnedRegistersDeviceAddress.
- asu_submit_flow_test: BuildSubBatchSendBuffersUsesHostPinnedDeviceAddresses.
- sqe_request_test: packed response address matches device_addr.
@yumingyue624 yumingyue624 changed the base branch from develop to feature_26h1 June 12, 2026 03:57
@yumingyue624 yumingyue624 requested a review from nrj868 as a code owner June 12, 2026 03:57
}

if (subBatchContext.flagBuffer.addr == 0 || subBatchContext.flagBuffer.length == 0) {
if (subBatchContext.sendSge.device_addr == 0 || subBatchContext.flagBuffer.addr == 0 ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about sendSge.addr and sendSge.length?

TransProvider::MemType providerMemType{TransProvider::MemType::MEM_HOST};
};

class BufferRegionCreator : public Trans::AscendBuffer {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming problem, the parent class is a Buffer and the child class is a Creator?

}
BufferRegionCreator regionCreator;
BufferRegion region;
auto allocStatus = regionCreator.MakeRegion(memory_type_, total, region);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Violating the inherit spirit, the child class has its own public function and will be called directly.

// The owner keeps the ACL registration alive until after HCOMM has
// unregistered the region in BufferManager's destructor.
auto owner = std::shared_ptr<void>(hostAddr, [](void* addr) {
aclrtHostUnregister(addr);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Warning] Custom deleter lacks error handling

The custom deleter calls aclrtHostUnregister(addr) and aclrtFreeHost(addr) without checking return values. If aclrtHostUnregister fails, calling aclrtFreeHost on still-registered memory may cause undefined behavior or resource leaks.

Consider:

  1. Logging errors in the deleter (even if silent)
  2. Ensuring the cleanup order is safe per ACL documentation
  3. Using aclrtFreeHost only if unregister succeeded

case MemoryType::HOST: {
auto owner = MakeHostBuffer(size);
if (!owner) { return AllocationFailed("host"); }
region = {owner, owner.get(), owner.get(), TransProvider::MemType::MEM_HOST};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] device_addr semantic inconsistency for HOST memory type

For MemoryType::HOST, deviceAddr is set equal to localAddr (owner.get()). This is semantically ambiguous because plain host memory does not have a device-visible mapping. The comment in buffer_manager.h states device_addr is for device-visible address used by HCOMM/RDMA, but setting it equal to addr for HOST type contradicts this.

Consider:

  1. Setting device_addr = 0 for HOST type (forcing callers to handle the case)
  2. Or explicitly documenting that addr == device_addr for HOST memory means the same physical address is used

}

if (subBatchContext.flagBuffer.addr == 0 || subBatchContext.flagBuffer.length == 0) {
if (subBatchContext.sendSge.device_addr == 0 || subBatchContext.flagBuffer.addr == 0 ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Missing validation for sendSge.length

The validation now checks sendSge.device_addr, flagBuffer.addr, flagBuffer.device_addr, and flagBuffer.length, but sendSge.length is not validated. If sendSge.length is 0 or invalid, the SendIoBatch will have an incorrect length parameter, potentially causing buffer overflow or underflow in RDMA operations.

Add: sendSge.length == 0 to the validation condition.

const auto offset = idx * slot_size_;
sge.addr = reinterpret_cast<std::uint64_t>(static_cast<char*>(memory_.get()) + offset);
sge.device_addr =
reinterpret_cast<std::uint64_t>(static_cast<char*>(device_memory_) + offset);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Warning] Potential null pointer arithmetic if device_memory_ is null

The device_addr calculation uses static_cast<char*>(device_memory_) + offset. If device_memory_ is nullptr (e.g., due to a failed Init that did not properly set it), this pointer arithmetic produces a garbage address value.

While Allocate() checks memory_ is valid at line 201, it does not verify device_memory_ is non-null. Consider adding: if (!device_memory_) return Status::Error(...)

// Local address used by CPU code for SQE packing and completion polling.
std::uint64_t addr{0};
// Device-visible address used by HCOMM/HIXL and remote RDMA operations.
std::uint64_t device_addr{0};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] Struct size increase may impact cache efficiency

Adding device_addr increases ScatterGatherEntry from ~24 bytes to ~32 bytes (assuming 64-bit system). In high-throughput scenarios with many pending SQEs, this 33% size increase could:

  1. Increase memory footprint for sub-batch contexts
  2. Reduce cache efficiency when processing batches
  3. Impact memory bandwidth in tight loops

This is acceptable given the feature requirement, but worth noting for performance-sensitive deployments.

request.dtype = GetTransportConfigAttr<std::uint8_t>(attrs, "dtype");
request.dspec = GetTransportConfigAttr<std::uint8_t>(attrs, "dspec");
request.response_buffer_addr = flagBuffer.addr;
request.response_buffer_addr = flagBuffer.device_addr;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Architectural consistency for user-provided buffer addresses

The response_buffer_addr now correctly uses flagBuffer.device_addr for RDMA operations. However, the entry.buffer_addr at line 212 uses entries[index].buffer.region.addr (CPU address). If user-provided KVBuffer regions are also host-pinned memory, they should similarly use device addresses for RDMA operations.

This creates a semantic inconsistency: internal buffers use device_addr for RDMA, but user buffers use addr. Consider:

  1. Adding a device_addr field to MemoryRegion struct
  2. Or documenting that user-provided buffers must use device-visible addresses if they are host-pinned
  3. Or providing a registration API that captures both addresses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants