Skip to content

Simplify RCCL in CVS with strict config and canonical run artifacts#115

Open
speriaswamy-amd wants to merge 5 commits intomainfrom
surya/rccl-simplification
Open

Simplify RCCL in CVS with strict config and canonical run artifacts#115
speriaswamy-amd wants to merge 5 commits intomainfrom
surya/rccl-simplification

Conversation

@speriaswamy-amd
Copy link
Copy Markdown
Contributor

  • Collapse RCCL onto one rccl_cvs runner path and remove legacy/ambiguous config handling.
  • Introduce a strict nested config model validated with Pydantic, with topology inferred from num_ranks and ranks_per_node.
  • Standardize outputs on one canonical per-run run.json, with optional raw per-case JSON only when requested.
  • Tighten tests and docs around the supported contract, and clean up RCCL naming for better maintainability.
  • Verified with focused unit tests plus real single-node and multi-node cluster runs.

export ANP_HOME_DIR="<changeme>/amd-anp"

export IONIC_LOCKFREE=all
export NCCL_IB_TC=0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traffic Class settings in line line 12 and 13 has to be changed/tuned as per customer network fabric config, so can we add a comment

export NCCL_IB_TC=0 #change it as per the network settings
export NCCL_IB_FIFO_TC=0 #change it as per the network settings

Comment on lines +19 to +21
export NCCL_DMABUF_ENABLE=0
export NCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
export NCCL_NET_PLUGIN=librccl-anp.so
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this env is the ainic is built in to RCCL?
Can we keep it commented

export NCCL_NET_PLUGIN=librccl-anp.so #Comment if RCCL version > XYZ used


export PATH="${MPI_HOME}/bin:${ROCM_HOME}/bin:${PATH}"
export LD_LIBRARY_PATH="${RCCL_HOME}/lib:${ANP_HOME_DIR}:${MPI_HOME}/lib:${ROCM_HOME}/lib:${LD_LIBRARY_PATH:-}"
export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME}/lib/librccl.so"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this env is the ainic is built in to RCCL?
Can we keep it commented

export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME}/lib/librccl.so" #Comment if RCCL version > XYZ used

or is it not harmful even in the RCCL build with ainic built into it?

@@ -0,0 +1,25 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the env scripts under below dir instead of config_file, as we discussed with venky
cvs/input/env_scripts/rccl/aini_env_script.sh

@@ -0,0 +1,25 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see mpi env variables(sample/default) set in the env file.

Comment thread cvs/lib/rccl/artifacts.py
return dt.strftime("%Y-%m-%dT%H-%M-%SZ")


def _no_matrix_case_id(collective_index: int, collective: str) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_no_matrix_case_id() - Name doesn't explain purpose

Better: _build_case_id() or _format_case_identifier()

Comment thread cvs/lib/rccl/artifacts.py
return json.dumps(obj, sort_keys=True, separators=(",", ":")).encode("utf-8")


def _ensure_unique_case_id(base: str, used: set[str], resolved: dict[str, Any]) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolved_case_payload() - "Resolved" and "payload" are unclear

Better: _build_test_case_config() or _create_case_metadata()

cmd_output[host] = cmd_output.get(host, "") + "\nABORT: Host Unreachable Error"

def _process_output(self, output, cmd=None, cmd_list=None, print_console=True):
def _process_output(self, output, cmd=None, cmd_list=None, print_console=True, print_console_prefix=True):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in stead of adding more and more print_* args, can we come up with something as below

`class OutputMode(Enum):
SILENT = "silent" # No console output
COMPACT = "compact" # Output only, no prefixes
VERBOSE = "verbose" # Full output with prefixes

def _process_output(self, output, cmd=None, cmd_list=None, output_mode=OutputMode.VERBOSE):`

Comment thread cvs/lib/rccl/artifacts.py
return re.sub(r"[^A-Za-z0-9._-]", "_", value)


def _format_run_id_utc(now: datetime | None = None) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify as below?

def _format_run_id_utc(now: datetime | None = None) -> str: dt = now or datetime.now(timezone.utc) return dt.astimezone(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")

@@ -0,0 +1,46 @@
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the old tests rccl_*_cvs.py will embed the various reports to the final test, report, introduced as part of below commit, which is missing in the refactored test script

d087b7a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants