[Bugfix] Environment pre-check code adaptation#1006
Conversation
ygwpz
left a comment
There was a problem hiding this comment.
Follow-up Review Summary
✅ Previous Issue Addressed
The encoding issue with the corrupted arrow character ("~R" instead of "→") has been fixed by replacing it with the word "to". Good fix!
🟠 New Issues Found
I found a few issues in the newly introduced bandwidth check implementation:
-
L3 - Global multiprocessing.Manager() at module level: The
dump_bw_listandload_bw_listare created at module level, causing unnecessary server process spawn and state accumulation across multiple runs. -
L4 - Resource leak: mmap objects in
make_array()are not explicitly closed. -
L4 - Exception swallowed: In
remote_local_hccn_cards_ping_test, exception is caught but not logged. -
Minor - Unused import: The
threadingmodule is imported but no longer used.
💡 Suggestions
- Consider making bandwidth test parameters configurable via
config.yaml - Add cleanup handling for mmap objects
See inline comments for specific code locations.
0f392fc to
18b9213
Compare
Purpose
Implement the Environment PreCheck Suite to perform comprehensive health checks for Ascend NPU clusters before model training/deployment, ensuring device availability, network connectivity, and storage performance stability.
Modifications
run_hccn_device_status_checkfunction: Implements HCCN device status detection (link neighbor, physical link, network health, IP configuration) and adds overall status aggregation logic.remote_local_hccn_cards_ping_testfunction: Supports local/remote HCCN card ping tests for NPU network connectivity verification with unified result return.Test
All tests have passed.