[Feat] use host-pinned memory with dual CPU/device addresses for transport buffers#1024
[Feat] use host-pinned memory with dual CPU/device addresses for transport buffers#1024yumingyue624 wants to merge 1 commit into
Conversation
… transport buffers ## Purpose Switch ASU send/flag buffers from plain host memory to host-pinned memory so that CPU code packs SQEs through the local mapping while HCOMM/RDMA uses the device-visible mapping of the same allocation. ## Modifications 1. BufferManager: allocate host-pinned memory via aclrtMallocHost + aclrtHostRegisterV2, obtain device pointer via aclrtHostGetDevicePointer. ScatterGatherEntry gains device_addr. RegisterMemory uses device address for host-pinned regions. 2. AsuTransportImpl: send/flag buffers use HOST_PINNED instead of HOST. 3. asu_submit_flow: pass device_addr to SendIoBatch. 4. sqe_request: use flagBuffer.device_addr for response_buffer_addr. 5. Tests: added host-pinned dual-address and device_addr assertions. ## Test - buffer_manager_test: HostPinnedRegistersDeviceAddress. - asu_submit_flow_test: BuildSubBatchSendBuffersUsesHostPinnedDeviceAddresses. - sqe_request_test: packed response address matches device_addr.
| } | ||
|
|
||
| if (subBatchContext.flagBuffer.addr == 0 || subBatchContext.flagBuffer.length == 0) { | ||
| if (subBatchContext.sendSge.device_addr == 0 || subBatchContext.flagBuffer.addr == 0 || |
There was a problem hiding this comment.
What about sendSge.addr and sendSge.length?
| TransProvider::MemType providerMemType{TransProvider::MemType::MEM_HOST}; | ||
| }; | ||
|
|
||
| class BufferRegionCreator : public Trans::AscendBuffer { |
There was a problem hiding this comment.
Naming problem, the parent class is a Buffer and the child class is a Creator?
| } | ||
| BufferRegionCreator regionCreator; | ||
| BufferRegion region; | ||
| auto allocStatus = regionCreator.MakeRegion(memory_type_, total, region); |
There was a problem hiding this comment.
Violating the inherit spirit, the child class has its own public function and will be called directly.
| // The owner keeps the ACL registration alive until after HCOMM has | ||
| // unregistered the region in BufferManager's destructor. | ||
| auto owner = std::shared_ptr<void>(hostAddr, [](void* addr) { | ||
| aclrtHostUnregister(addr); |
There was a problem hiding this comment.
[Warning] Custom deleter lacks error handling
The custom deleter calls aclrtHostUnregister(addr) and aclrtFreeHost(addr) without checking return values. If aclrtHostUnregister fails, calling aclrtFreeHost on still-registered memory may cause undefined behavior or resource leaks.
Consider:
- Logging errors in the deleter (even if silent)
- Ensuring the cleanup order is safe per ACL documentation
- Using aclrtFreeHost only if unregister succeeded
| case MemoryType::HOST: { | ||
| auto owner = MakeHostBuffer(size); | ||
| if (!owner) { return AllocationFailed("host"); } | ||
| region = {owner, owner.get(), owner.get(), TransProvider::MemType::MEM_HOST}; |
There was a problem hiding this comment.
[Suggestion] device_addr semantic inconsistency for HOST memory type
For MemoryType::HOST, deviceAddr is set equal to localAddr (owner.get()). This is semantically ambiguous because plain host memory does not have a device-visible mapping. The comment in buffer_manager.h states device_addr is for device-visible address used by HCOMM/RDMA, but setting it equal to addr for HOST type contradicts this.
Consider:
- Setting device_addr = 0 for HOST type (forcing callers to handle the case)
- Or explicitly documenting that addr == device_addr for HOST memory means the same physical address is used
| } | ||
|
|
||
| if (subBatchContext.flagBuffer.addr == 0 || subBatchContext.flagBuffer.length == 0) { | ||
| if (subBatchContext.sendSge.device_addr == 0 || subBatchContext.flagBuffer.addr == 0 || |
There was a problem hiding this comment.
[Critical] Missing validation for sendSge.length
The validation now checks sendSge.device_addr, flagBuffer.addr, flagBuffer.device_addr, and flagBuffer.length, but sendSge.length is not validated. If sendSge.length is 0 or invalid, the SendIoBatch will have an incorrect length parameter, potentially causing buffer overflow or underflow in RDMA operations.
Add: sendSge.length == 0 to the validation condition.
| const auto offset = idx * slot_size_; | ||
| sge.addr = reinterpret_cast<std::uint64_t>(static_cast<char*>(memory_.get()) + offset); | ||
| sge.device_addr = | ||
| reinterpret_cast<std::uint64_t>(static_cast<char*>(device_memory_) + offset); |
There was a problem hiding this comment.
[Warning] Potential null pointer arithmetic if device_memory_ is null
The device_addr calculation uses static_cast<char*>(device_memory_) + offset. If device_memory_ is nullptr (e.g., due to a failed Init that did not properly set it), this pointer arithmetic produces a garbage address value.
While Allocate() checks memory_ is valid at line 201, it does not verify device_memory_ is non-null. Consider adding: if (!device_memory_) return Status::Error(...)
| // Local address used by CPU code for SQE packing and completion polling. | ||
| std::uint64_t addr{0}; | ||
| // Device-visible address used by HCOMM/HIXL and remote RDMA operations. | ||
| std::uint64_t device_addr{0}; |
There was a problem hiding this comment.
[Performance] Struct size increase may impact cache efficiency
Adding device_addr increases ScatterGatherEntry from ~24 bytes to ~32 bytes (assuming 64-bit system). In high-throughput scenarios with many pending SQEs, this 33% size increase could:
- Increase memory footprint for sub-batch contexts
- Reduce cache efficiency when processing batches
- Impact memory bandwidth in tight loops
This is acceptable given the feature requirement, but worth noting for performance-sensitive deployments.
| request.dtype = GetTransportConfigAttr<std::uint8_t>(attrs, "dtype"); | ||
| request.dspec = GetTransportConfigAttr<std::uint8_t>(attrs, "dspec"); | ||
| request.response_buffer_addr = flagBuffer.addr; | ||
| request.response_buffer_addr = flagBuffer.device_addr; |
There was a problem hiding this comment.
[Suggestion] Architectural consistency for user-provided buffer addresses
The response_buffer_addr now correctly uses flagBuffer.device_addr for RDMA operations. However, the entry.buffer_addr at line 212 uses entries[index].buffer.region.addr (CPU address). If user-provided KVBuffer regions are also host-pinned memory, they should similarly use device addresses for RDMA operations.
This creates a semantic inconsistency: internal buffers use device_addr for RDMA, but user buffers use addr. Consider:
- Adding a device_addr field to MemoryRegion struct
- Or documenting that user-provided buffers must use device-visible addresses if they are host-pinned
- Or providing a registration API that captures both addresses
Purpose
Switch ASU send/flag buffers from plain host memory to host-pinned
memory so that CPU code packs SQEs through the local mapping while
HCOMM/RDMA uses the device-visible mapping of the same allocation.
Modifications
aclrtHostRegisterV2, obtain device pointer via
aclrtHostGetDevicePointer. ScatterGatherEntry gains device_addr.
RegisterMemory uses device address for host-pinned regions.
Test