Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -112,19 +112,19 @@ your client:
- To set the number of training steps to 100, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.arguments[0]="trainer.max_steps=100"
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.arguments[0]="trainer.max_steps=100"
```

### Monitor the job

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-bf16-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-bf16-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=a4x-llama3-1-70b-fp8cs-gbs2048-gpus64
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=a4x-llama3-1-70b-fp8cs-gbs2048-gpus64
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-64node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus256.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-64node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus256.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-8b-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-8b-bf16-gbs1024-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-llama3-1-8b-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-8b-bf16-gbs1024-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,18 +86,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=a4x-llama3-1-8b-fp8cs-gbs128-gpus64-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-8b-fp8cs-gbs128-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=a4x-llama3-1-8b-fp8cs-gbs128-gpus64-16node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=llama3-1-8b-fp8cs-gbs128-gpus64.py \
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-nemotron4-340b-32node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=nemotron4-340b-fp8cs-gbs256-gpus128.py \
--set workload.image=nvcr.io/nvidia/nemo:25.09 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```
```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-nemotron4-340b-32node
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set-file workload_config=nemotron4-340b-fp8cs-gbs256-gpus128.py \
--set workload.image=nvcr.io/nvidia/nemo:25.09 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME}
```

**Examples**

Expand Down