Simplify RCCL in CVS with strict config and canonical run artifacts#115
Simplify RCCL in CVS with strict config and canonical run artifacts#115speriaswamy-amd wants to merge 5 commits intomainfrom
Conversation
speriaswamy-amd
commented
Apr 3, 2026
- Collapse RCCL onto one rccl_cvs runner path and remove legacy/ambiguous config handling.
- Introduce a strict nested config model validated with Pydantic, with topology inferred from num_ranks and ranks_per_node.
- Standardize outputs on one canonical per-run run.json, with optional raw per-case JSON only when requested.
- Tighten tests and docs around the supported contract, and clean up RCCL naming for better maintainability.
- Verified with focused unit tests plus real single-node and multi-node cluster runs.
| export ANP_HOME_DIR="<changeme>/amd-anp" | ||
|
|
||
| export IONIC_LOCKFREE=all | ||
| export NCCL_IB_TC=0 |
There was a problem hiding this comment.
Traffic Class settings in line line 12 and 13 has to be changed/tuned as per customer network fabric config, so can we add a comment
export NCCL_IB_TC=0 #change it as per the network settings
export NCCL_IB_FIFO_TC=0 #change it as per the network settings
| export NCCL_DMABUF_ENABLE=0 | ||
| export NCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0 | ||
| export NCCL_NET_PLUGIN=librccl-anp.so |
There was a problem hiding this comment.
Do we need this env is the ainic is built in to RCCL?
Can we keep it commented
export NCCL_NET_PLUGIN=librccl-anp.so #Comment if RCCL version > XYZ used
|
|
||
| export PATH="${MPI_HOME}/bin:${ROCM_HOME}/bin:${PATH}" | ||
| export LD_LIBRARY_PATH="${RCCL_HOME}/lib:${ANP_HOME_DIR}:${MPI_HOME}/lib:${ROCM_HOME}/lib:${LD_LIBRARY_PATH:-}" | ||
| export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME}/lib/librccl.so" |
There was a problem hiding this comment.
Do we need this env is the ainic is built in to RCCL?
Can we keep it commented
export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME}/lib/librccl.so" #Comment if RCCL version > XYZ used
or is it not harmful even in the RCCL build with ainic built into it?
| @@ -0,0 +1,25 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
We need to keep the env scripts under below dir instead of config_file, as we discussed with venky
cvs/input/env_scripts/rccl/aini_env_script.sh
| @@ -0,0 +1,25 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
I dont see mpi env variables(sample/default) set in the env file.
| return dt.strftime("%Y-%m-%dT%H-%M-%SZ") | ||
|
|
||
|
|
||
| def _no_matrix_case_id(collective_index: int, collective: str) -> str: |
There was a problem hiding this comment.
_no_matrix_case_id() - Name doesn't explain purpose
Better: _build_case_id() or _format_case_identifier()
| return json.dumps(obj, sort_keys=True, separators=(",", ":")).encode("utf-8") | ||
|
|
||
|
|
||
| def _ensure_unique_case_id(base: str, used: set[str], resolved: dict[str, Any]) -> str: |
There was a problem hiding this comment.
_resolved_case_payload() - "Resolved" and "payload" are unclear
Better: _build_test_case_config() or _create_case_metadata()
| cmd_output[host] = cmd_output.get(host, "") + "\nABORT: Host Unreachable Error" | ||
|
|
||
| def _process_output(self, output, cmd=None, cmd_list=None, print_console=True): | ||
| def _process_output(self, output, cmd=None, cmd_list=None, print_console=True, print_console_prefix=True): |
There was a problem hiding this comment.
in stead of adding more and more print_* args, can we come up with something as below
`class OutputMode(Enum):
SILENT = "silent" # No console output
COMPACT = "compact" # Output only, no prefixes
VERBOSE = "verbose" # Full output with prefixes
def _process_output(self, output, cmd=None, cmd_list=None, output_mode=OutputMode.VERBOSE):`
| return re.sub(r"[^A-Za-z0-9._-]", "_", value) | ||
|
|
||
|
|
||
| def _format_run_id_utc(now: datetime | None = None) -> str: |
There was a problem hiding this comment.
can we simplify as below?
def _format_run_id_utc(now: datetime | None = None) -> str: dt = now or datetime.now(timezone.utc) return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
| @@ -0,0 +1,46 @@ | |||
| """ | |||
There was a problem hiding this comment.
All the old tests rccl_*_cvs.py will embed the various reports to the final test, report, introduced as part of below commit, which is missing in the refactored test script