diff --git a/training/a4x/llama3-1-405b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md b/training/a4x/llama3-1-405b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md index 02394166..cc7af968 100644 --- a/training/a4x/llama3-1-405b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md +++ b/training/a4x/llama3-1-405b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md @@ -112,19 +112,19 @@ your client: - To set the number of training steps to 100, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=$USER-a4x-llama3-1-405b - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} \ - --set workload.arguments[0]="trainer.max_steps=100" - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4x-llama3-1-405b +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} \ +--set workload.arguments[0]="trainer.max_steps=100" +``` ### Monitor the job diff --git a/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-BF16-GBS2048/recipe/README.md b/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-BF16-GBS2048/recipe/README.md index bcd42f57..b402f3f2 100644 --- a/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-BF16-GBS2048/recipe/README.md +++ b/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-BF16-GBS2048/recipe/README.md @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-16node - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-70b-bf16-gbs2048-gpus64.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-16node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-70b-bf16-gbs2048-gpus64.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` **Examples** diff --git a/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md b/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md index c2c58795..b53d45b7 100644 --- a/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md +++ b/training/a4x/llama3-1-70b/nemo-pretraining-gke/16node-FP8CS-GBS2048/recipe/README.md @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=a4x-llama3-1-70b-fp8cs-gbs2048-gpus64 - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=a4x-llama3-1-70b-fp8cs-gbs2048-gpus64 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` **Examples** diff --git a/training/a4x/llama3-1-70b/nemo-pretraining-gke/32node-FP8CS-GBS2048/recipe/README.md b/training/a4x/llama3-1-70b/nemo-pretraining-gke/32node-FP8CS-GBS2048/recipe/README.md index bede2a65..ee39cd62 100644 --- a/training/a4x/llama3-1-70b/nemo-pretraining-gke/32node-FP8CS-GBS2048/recipe/README.md +++ b/training/a4x/llama3-1-70b/nemo-pretraining-gke/32node-FP8CS-GBS2048/recipe/README.md @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` **Examples** diff --git a/training/a4x/llama3-1-70b/nemo-pretraining-gke/64node-FP8CS-GBS2048/recipe/README.md b/training/a4x/llama3-1-70b/nemo-pretraining-gke/64node-FP8CS-GBS2048/recipe/README.md index 0ca3c26d..ec95a07d 100644 --- a/training/a4x/llama3-1-70b/nemo-pretraining-gke/64node-FP8CS-GBS2048/recipe/README.md +++ b/training/a4x/llama3-1-70b/nemo-pretraining-gke/64node-FP8CS-GBS2048/recipe/README.md @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-64node - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus256.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-64node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus256.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` **Examples** diff --git a/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-BF16-GBS1024/recipe/README.md b/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-BF16-GBS1024/recipe/README.md index 5ef4f3c8..ed347fbc 100644 --- a/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-BF16-GBS1024/recipe/README.md +++ b/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-BF16-GBS1024/recipe/README.md @@ -87,18 +87,18 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - ```bash - cd $RECIPE_ROOT - export WORKLOAD_NAME=$USER-a4x-llama3-1-8b-16node - helm install $WORKLOAD_NAME . -f values.yaml \ - --set-file workload_launcher=launcher.sh \ - --set-file workload_config=llama3-1-8b-bf16-gbs1024-gpus64.py \ - --set workload.image=nvcr.io/nvidia/nemo:25.07 \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set volumes.gcsMounts[0].mountPath=/job-logs \ - --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ - --set queue=${KUEUE_NAME} - ``` +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4x-llama3-1-8b-16node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-8b-bf16-gbs1024-gpus64.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.07 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` **Examples** diff --git a/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-FP8CS-GBS128/recipe/README.md b/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-FP8CS-GBS128/recipe/README.md index cd68cdcd..2c182e86 100644 --- a/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-FP8CS-GBS128/recipe/README.md +++ b/training/a4x/llama3-1-8b/nemo-pretraining-gke/16node-FP8CS-GBS128/recipe/README.md @@ -86,7 +86,7 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION To execute the job with the default settings, run the following command from your client: - bash + ```bash cd $RECIPE_ROOT export WORKLOAD_NAME=a4x-llama3-1-8b-fp8cs-gbs128-gpus64-16node helm install $WORKLOAD_NAME . -f values.yaml \ @@ -97,6 +97,7 @@ your client: --set volumes.gcsMounts[0].mountPath=/job-logs \ --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ --set queue=${KUEUE_NAME} + ``` **Examples**