Skip to content

Commit 504139c

Browse files
committed
Update scripts for Dawn to match Baskerville training
Updates the scripts and instructions for Dawn. The main purposes of these changes are: 1. To ensure the training aligns with that down on Baskerville. 2. To ensure enough data is downloaded. 3. To provide scripts for all of the steps for improved reproducibility. 4. To update the instructions to make the steps clear.
1 parent 17d6070 commit 504139c

File tree

13 files changed

+699
-61
lines changed

13 files changed

+699
-61
lines changed

dawn/batch/dawn-download-era5.sh

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
#!/bin/bash -l
2+
#SBATCH --job-name venv
3+
#SBATCH --output results/download-era5-%A.out
4+
#SBATCH --account airr-p8-rcpp-dawn-gpu
5+
#SBATCH --partition pvc9 # Dawn PVC partition
6+
#SBATCH --cpus-per-task 24 # Number of cores per task
7+
#SBATCH --nodes 1 # Number as nodes
8+
#SBATCH --gpus-per-node 1 # Number of requested GPUs per node
9+
#SBATCH --ntasks-per-node 1 # MPI ranks per node
10+
#SBATCH --time 02:00:00
11+
12+
# Execute using:
13+
# sbatch ./dawn-download-era5.sh
14+
15+
echo
16+
echo "## Aurora download ERA5 data script starting"
17+
18+
# Quit on error
19+
set -e
20+
21+
pushd ../scripts
22+
23+
echo
24+
echo "## Loading modules"
25+
26+
module purge
27+
module load default-dawn
28+
module load lua
29+
module load intel-oneapi-ccl/2021.14.0
30+
module load intel-oneapi-mpi/2021.14.1
31+
module load intel-oneapi-mkl/2025.0.1
32+
33+
echo
34+
echo "## Configuring environment"
35+
36+
VENV_DIR=../../dawn/environments/venv_3_11_11
37+
38+
echo
39+
echo "## Initialising virtual environment"
40+
41+
source ${VENV_DIR}/bin/activate
42+
43+
echo
44+
echo "## Details"
45+
echo
46+
echo "Nodes: ${SLURM_JOB_NUM_NODES}"
47+
echo "GPUs per node: ${SLURM_GPUS_PER_NODE}"
48+
echo "Tasks per node: ${SLURM_NTASKS_PER_NODE}"
49+
echo "CPUS per task: ${SLURM_CPUS_PER_TASK}"
50+
echo "Working directory: $(realpath ${PWD})"
51+
echo "Location of venv: $(realpath $VENV_DIR)"
52+
53+
echo
54+
echo "## Downloading data"
55+
56+
START=$(date +%s)
57+
python era_v_download.py
58+
END=$(date +%s)
59+
ELAPSED=$((${END}-${START}))
60+
61+
echo
62+
echo "## Details post"
63+
echo
64+
echo "Time completed: $(date --iso-8601=ns)"
65+
echo "Epoch start: ${START}"
66+
echo "Epoch end: ${END}"
67+
echo "Elapsed: ${ELAPSED} seconds"
68+
69+
echo
70+
echo "## Tidying up"
71+
72+
deactivate
73+
popd
74+
75+
echo
76+
echo "## Aurora download ERA5 data script completed"

dawn/scripts/era_v_download.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242
print("Static variables downloaded!")
4343

4444
# Download the surface-level variables.
45-
if not (download_path / "2023-01-surface-level.nc").exists():
45+
if not (download_path / "2023-01-surface-level-36.nc").exists():
4646
c.retrieve(
4747
"reanalysis-era5-single-levels",
4848
{
@@ -69,12 +69,12 @@
6969
"time": ["00:00", "06:00", "12:00", "18:00"],
7070
"format": "netcdf",
7171
},
72-
str(download_path / "2023-01-surface-level.nc"),
72+
str(download_path / "2023-01-surface-level-36.nc"),
7373
)
7474
print("Surface-level variables downloaded!")
7575

7676
# Download the atmospheric variables.
77-
if not (download_path / "2023-01-atmospheric.nc").exists():
77+
if not (download_path / "2023-01-atmospheric-36.nc").exists():
7878
c.retrieve(
7979
"reanalysis-era5-pressure-levels",
8080
{
@@ -117,6 +117,6 @@
117117
"time": ["00:00", "06:00", "12:00", "18:00"],
118118
"format": "netcdf",
119119
},
120-
str(download_path / "2023-01-atmospheric.nc"),
120+
str(download_path / "2023-01-atmospheric-36.nc"),
121121
)
122122
print("Atmospheric variables downloaded!")

src/aurora_hpc/dataset.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ def __init__(
3636
surface_data: str | Path | xr.Dataset = Path("2023-01-01-surface-level.nc"),
3737
atmos_data: str | Path | xr.Dataset = Path("2023-01-01-atmospheric.nc"),
3838
use_dask: bool = False,
39+
len_max: int = None,
3940
):
4041
self.t = t
4142

@@ -80,6 +81,8 @@ def __init__(
8081
self.length = (
8182
len(torch.from_numpy(self.surf_vars_ds["t2m"].values)) - self.t - 1
8283
)
84+
if len_max:
85+
self.length = min(self.length, len_max)
8386

8487
def _get_batch(self, timerange):
8588
"""Returns a batch covering a time range.

train/README.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,17 @@
33
The code in this folder is for performing various training related experiments.
44
See below for instructions for how to run them.
55

6-
## Running within an interactive session
6+
## Baskerville
7+
8+
### Running within an interactive session
79

810
To run the interactives session scripts, first ensure you're on a compute node by running the following or an equiavlent `srun` command (you'll need to update the QoS and account details):
911

1012
```sh
1113
srun --qos turing --account usjs9456-ati-test --time 1:00:00 --nodes 1 --gpus 1 --cpus-per-gpu 36 --mem 16384 --pty /bin/bash
1214
```
1315

14-
## Queued jobs using sbatch
16+
### Queued jobs using sbatch
1517

1618
All sbatch scripts have a QoS and account details set in them.
1719
The parameters used for these will depend on your account and so should be adjusted accordingly.
@@ -21,7 +23,7 @@ The parameters used for these will depend on your account and so should be adjus
2123
#SBATCH --account usjs9456-ati-test
2224
```
2325

24-
## Baskerville training using FSDP
26+
### Training using FSDP
2527

2628
The case of a single node can be run within an srun interactive session or scheduled using the sbatch scripts.
2729

@@ -63,7 +65,7 @@ sbatch bask-train-fsdp-4x4.sh
6365
This is set up to run on 2 nodes with 2 GPUs and to perform just a single run.
6466
Edit the script header to test other combinations.
6567

66-
## Baskerville bandwidth
68+
### Bandwidth
6769

6870
All bandwidth experiments should be run within an interactive session and from within the `aurora-hpc/train/batch` directory.
6971

@@ -78,3 +80,72 @@ To run the GPU bandwidth experiments:
7880
```sh
7981
./bask-srun-gpubw.sh
8082
```
83+
84+
## Dawn
85+
86+
The Dawn scripts are all done through batch jobs, no interactive session scripts.
87+
88+
All scripts should be run directly from the `aurora-hpc/train/batch` directory.
89+
90+
91+
### Creating the virtual environment
92+
93+
Before performing any training the virtual environment should be created by running the following script:
94+
95+
```sh
96+
cd aurora-hpc/train/batch
97+
sbatch dawn-create-venv.sh
98+
```
99+
100+
This will create a virtual environment in the `aurora-hpc/dawn/environments/venv_3_11_11` directory.
101+
102+
## Download the data
103+
104+
The data must also be downloaded before training can commence.
105+
This also requires that you've created an account with the Climate Data Store and created a `.cdsapirc` file in your homd directory with the following contents:
106+
107+
```sh
108+
url: https://cds.climate.copernicus.eu/api
109+
key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx
110+
```
111+
112+
Here the `x` values must be replaced by your access key.
113+
You can find out more about how to do this on the [Aurora ERA5 page](https://microsoft.github.io/aurora/example_era5.html).
114+
115+
Once you've set up your access key you can then download the data directly to Dawn using the following sbatch script.
116+
117+
```sh
118+
aurora-hpc/dawn/batch/dawn-download-era5.sh
119+
```
120+
121+
If successful this will result in the following files being downoaded to the `aurora-hpc/dawn/era5/era_v_inf` directory.
122+
123+
```
124+
2023-01-atmospheric-36.nc
125+
2023-01-surface-level-36.nc
126+
static.nc
127+
```
128+
129+
### Training
130+
131+
Once the virtual environment is set up, the training can be executed by queueing the appropriate script for the number of nodes and GPUs you want to use.
132+
133+
The following example is for one node and one GPU:
134+
135+
```sh
136+
cd aurora-hpc/train/batch
137+
sbatch dawn-train-ddp-1x1.sh
138+
```
139+
140+
The other available configurations are the following:
141+
142+
```sh
143+
dawn-train-ddp-1x1.sh # One node with one GPU (one GPU total)
144+
dawn-train-ddp-1x4.sh # One node with four GPUs (four GPUs total)
145+
dawn-train-ddp-2x4.sh # Two nodes with two GPUs each (four GPUs total)
146+
dawn-train-ddp-2x8.sh # Two nodes with four GPUs each (eight GPUs total)
147+
dawn-train-ddp-4x4.sh # Four nodes with one GPU each (four GPUs total)
148+
dawn-train-ddp-4x8.sh # Four nodes with two GUUs each (eight GPUs total)
149+
```
150+
151+
After each run the output logs will be sent to the `aurora-hpc/train/batch/results` directory.

train/batch/dawn-create-venv.sh

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/bin/bash -l
2+
#SBATCH --job-name venv
3+
#SBATCH --output results/create-venv-%A.out
4+
#SBATCH --account airr-p8-rcpp-dawn-gpu
5+
#SBATCH --partition pvc9 # Dawn PVC partition
6+
#SBATCH --cpus-per-task 24 # Number of cores per task
7+
#SBATCH --nodes 1 # Number as nodes
8+
#SBATCH --gpus-per-node 1 # Number of requested GPUs per node
9+
#SBATCH --ntasks-per-node 1 # MPI ranks per node
10+
#SBATCH --time 01:00:00
11+
12+
# Execute using:
13+
# sbatch ./dawn-create-venv.sh
14+
15+
echo
16+
echo "## Aurora create virtual environment script starting"
17+
18+
# Quit on error
19+
set -e
20+
21+
pushd ../scripts
22+
23+
echo
24+
echo "## Loading modules"
25+
26+
module purge
27+
module load default-dawn
28+
module load lua
29+
module load intel-oneapi-ccl/2021.14.0
30+
module load intel-oneapi-mpi/2021.14.1
31+
module load intel-oneapi-mkl/2025.0.1
32+
33+
echo
34+
echo "## Configuring environment"
35+
36+
VENV_DIR=../../dawn/environments/venv_3_11_11
37+
38+
echo
39+
echo "## Initialising virtual environment"
40+
41+
python3.11 -m venv $VENV_DIR
42+
. ${VENV_DIR}/bin/activate
43+
44+
echo
45+
echo "## Details"
46+
echo
47+
echo "Nodes: ${SLURM_JOB_NUM_NODES}"
48+
echo "GPUs per node: ${SLURM_GPUS_PER_NODE}"
49+
echo "Tasks per node: ${SLURM_NTASKS_PER_NODE}"
50+
echo "CPUS per task: ${SLURM_CPUS_PER_TASK}"
51+
echo "Working directory: $(realpath ${PWD})"
52+
echo "Location of venv: $(realpath $VENV_DIR)"
53+
54+
echo
55+
echo "## Installing packages"
56+
57+
START=$(date +%s)
58+
pip install --upgrade pip
59+
pip install -e ../../.[dawn]
60+
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/xpu
61+
pip install --trusted-host pytorch-extension.intel.com intel-extension-for-pytorch==2.7.10+xpu oneccl_bind_pt==2.7.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
62+
END=$(date +%s)
63+
ELAPSED=$((${END}-${START}))
64+
65+
echo
66+
echo "## Details post"
67+
echo
68+
echo "Time completed: $(date --iso-8601=ns)"
69+
echo "Epoch start: ${START}"
70+
echo "Epoch end: ${END}"
71+
echo "Elapsed: ${ELAPSED} seconds"
72+
73+
echo
74+
echo "## Tidying up"
75+
76+
deactivate
77+
popd
78+
79+
echo
80+
echo "## Aurora create virtual environmnent script completed"

0 commit comments

Comments
 (0)