Skip to content

[Bugfix] Environment pre-check code adaptation#1006

Open
Menglths wants to merge 1 commit into
ModelEngine-Group:developfrom
Menglths:EnvPreCheck_2
Open

[Bugfix] Environment pre-check code adaptation#1006
Menglths wants to merge 1 commit into
ModelEngine-Group:developfrom
Menglths:EnvPreCheck_2

Conversation

@Menglths

@Menglths Menglths commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Implement the Environment PreCheck Suite to perform comprehensive health checks for Ascend NPU clusters before model training/deployment, ensuring device availability, network connectivity, and storage performance stability.

Modifications

  1. Added run_hccn_device_status_check function: Implements HCCN device status detection (link neighbor, physical link, network health, IP configuration) and adds overall status aggregation logic.
  2. Added remote_local_hccn_cards_ping_test function: Supports local/remote HCCN card ping tests for NPU network connectivity verification with unified result return.
  3. Added complete storage bandwidth check module: Implements multi-process concurrent dump/load performance testing, calculates average bandwidth, and outputs standardized performance reports.

Test

All tests have passed.

@Menglths Menglths requested review from Wwwzff, mag1c-h and ygwpz as code owners June 8, 2026 13:01
Comment thread test/common/envPreCheck/run_env_preCheck.py Outdated
Comment thread test/common/envPreCheck/run_env_preCheck.py Outdated
Comment thread test/common/envPreCheck/run_env_preCheck.py Outdated
Comment thread test/common/envPreCheck/run_env_preCheck.py
Comment thread test/common/envPreCheck/run_env_preCheck.py

@ygwpz ygwpz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review Summary

✅ Previous Issue Addressed

The encoding issue with the corrupted arrow character ("὆~R" instead of "→") has been fixed by replacing it with the word "to". Good fix!

🟠 New Issues Found

I found a few issues in the newly introduced bandwidth check implementation:

  1. L3 - Global multiprocessing.Manager() at module level: The dump_bw_list and load_bw_list are created at module level, causing unnecessary server process spawn and state accumulation across multiple runs.

  2. L4 - Resource leak: mmap objects in make_array() are not explicitly closed.

  3. L4 - Exception swallowed: In remote_local_hccn_cards_ping_test, exception is caught but not logged.

  4. Minor - Unused import: The threading module is imported but no longer used.

💡 Suggestions

  • Consider making bandwidth test parameters configurable via config.yaml
  • Add cleanup handling for mmap objects

See inline comments for specific code locations.

@Menglths Menglths force-pushed the EnvPreCheck_2 branch 3 times, most recently from 0f392fc to 18b9213 Compare June 15, 2026 06:28
Comment thread test/common/envPreCheck/run_env_preCheck.py Outdated
Comment thread test/common/envPreCheck/run_env_preCheck.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants