Skip to content

Commit 997d6cc

Browse files
llewelldIain-S
authored andcommitted
Update Baskerville sbatch and srun scripts
Updates the scripts for queuing using sbatch or running within an interactive shell on a compute node using srun. The objective with these changes is to ensure: 1. Everything works simply by running the scripts. 2. There's an element of consistency across the various scripts. The minimum xarray version has been increased on Baskerville to 2023.06.0 in order to avoid Issue #7880 which I experienced consistently when executing the runmodel.py version. See here for the relevant xarray issue and changelogs: pydata/xarray#7880 https://github.com/pydata/xarray/releases/tag/v2023.06.0
1 parent c04e908 commit 997d6cc

30 files changed

+220
-178
lines changed

baskerville/dawn-comparison/README.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -31,41 +31,42 @@ preds_{i}-bask.pkl
3131
## Generating results and graphs
3232

3333
To run the script to generate the results on Baskerville and output the graphs, use the following:
34-
```
34+
```bash
3535
sbatch batch-comparison.py
3636
```
3737

3838
## Manual graph generation
3939

4040
While working on the graphs it can be convenient to run the graph generation script manually.
4141
This can be done in an `srun` shell:
42-
```
42+
```bash
4343
srun --qos turing --account usjs9456-ati-test --time 10:00:00 --nodes 1 \\
4444
--gpus 1 --cpus-per-gpu 36 --mem 65536 --pty /bin/bash
4545
```
4646

4747
Then source the following file to set up the environment:
48-
```
48+
```bash
4949
. ./batch-srun.sh
5050
```
5151

52-
Finally run the graph generation script. Any errors will cause the `srun` session to abort, so we block error return values when running this for convenience during development.
53-
```
54-
python compare-results.py || true
52+
Finally run the graph generation script.
53+
The value 4 passed in as the `-n` parameter is the number of `preds` files to use.
54+
In general this should be left as four to match the files generated as explained above.
55+
```bash
56+
python compare-results.py -d "../../downloads" -i "pdf" -n 4
5557
```
5658

5759
## Output graphs
5860

59-
Graphs will be output in both PNG and PDF format, as the following files:
61+
Graphs will be output in in the format spacified on the command line for the `-i` parameter.
62+
If you followed the above steps these will be in PDF format (PNG and SVG are also supported).
6063
```
6164
plot-errors.pdf
62-
plot-errors.png
65+
plot-error-comparison.pdf
6366
plot-losses.pdf
64-
plot-losses.png
6567
plot-pvg-bask.pdf
66-
plot-pvg-bask.png
6768
plot-pvg-dawn.pdf
68-
plot-pvg-dawn.png
69+
plot-std-dev-comparison.pdf
6970
plot-var-losses.pdf
70-
plot-var-losses.png
71+
plot-weatherbench-comparison.pdf
7172
```

baskerville/dawn-comparison/batch-comparison.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
#SBATCH --gpus 1
88
#SBATCH --cpus-per-gpu 36
99
#SBATCH --mem 0
10-
#SBATCH --job-name auroria-comparison
10+
#SBATCH --job-name aurora-comparison
1111
#SBATCH --output log-comparison.txt
1212

1313
# Execute using:
@@ -19,7 +19,7 @@ echo "## Aurora comparison script starting"
1919
# Quit on error
2020
set -e
2121

22-
if [ ! -d ../era5-experiments/downloads ]; then
22+
if [ ! -d ../../downloads ]; then
2323
echo "Please run the batch-download.sh script to download the data."
2424
exit 1
2525
fi

baskerville/dawn-comparison/batch-inference-timing.sh

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ echo "## Aurora inference timing script starting"
1919
# Quit on error
2020
set -e
2121

22-
if [ ! -d ../era5-experiments/downloads ]; then
22+
if [ ! -d ../../downloads ]; then
2323
echo "Please run the batch-download.sh script to download the data."
2424
exit 1
2525
fi
@@ -30,18 +30,15 @@ echo "## Loading modules"
3030
module -q purge
3131
module -q load baskerville
3232
module -q load bask-apps/live
33-
module -q load matplotlib/3.7.2-gfbf-2023a
3433
module -q load PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1
3534

3635
echo
3736
echo "## Initialising virtual environment"
3837

39-
python -m venv venv
38+
python3.11 -m venv venv
4039
. ./venv/bin/activate
4140

4241
pip install --quiet --upgrade pip
43-
pip install --quiet cdsapi
44-
pip install --quiet microsoft-aurora
4542
pip install --quiet -e ../../.[bask]
4643

4744
echo
@@ -53,11 +50,14 @@ vmstat -t 1 -y > log-comparison-cpu.txt &
5350

5451
# Perform the prediction
5552
# do this 4 times, once per GPU
53+
unset WAITING
5654
for i in {0..3}; do
57-
CUDA_VISIBLE_DEVICES=$i python inference-timing.py -n 28 --save -o preds_$i.pkl > inference_28_steps_$i.txt &
55+
CUDA_VISIBLE_DEVICES=$i python inference-timing.py -n 28 -d ../../downloads --save -o preds_$i.pkl > inference_28_steps_$i.txt &
56+
WAITING+=( $! );
5857
done
5958

60-
wait
59+
# Wait only for the processes started in the for loop
60+
wait "${WAITING[@]}"
6161

6262
echo
6363
echo "## Tidying up"

baskerville/era5-experiments/README.md

Lines changed: 35 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,56 @@ https://microsoft.github.io/aurora/example_era5.html
55
## Set up
66

77
Clone the repository:
8-
```
8+
```bash
99
git clone --recursive https://github.com/alan-turing-institute/aurora-hpc.git
1010
cd aurora-hpc/baskerville/era5-prediction
1111
```
1212

1313
Get your API key from the Climate Data Store (see the page linked above).
1414
Store it in the `cdsapi.config` file by running the following, replacing APIKEY with your actual API key.
1515

16-
```
16+
```bash
1717
printf "%s%s\n" "$(cat cdsapi.config.example)" "APIKEY" > cdsapi.config
1818
```
1919

20-
## Download the data
20+
## Interactive session
2121

22+
The instructions in the following sections explain how to run the experiments using queued tasks using `sbatch`.
23+
However many of these can also be run within an interactive `srun` session, which can be convenient during development.
24+
Setting up a session for use with the scripts can be done as follows.
25+
26+
```bash
27+
srun --qos turing --account usjs9456-ati-test --time 1:00:00 \
28+
--nodes 1 --gpus 1 --cpus-per-gpu 36 --mem 0 --pty /bin/bash
29+
. ./batch-srun.sh
2230
```
31+
32+
This will set up modules, environment and virtual environment.
33+
You can then run scripts directly, for example:
34+
35+
```bash
36+
python download.py
37+
```
38+
39+
## Download the data
40+
41+
This will download the data to the `aurora-hpc/downloads` directory.
42+
43+
```bash
2344
sbatch batch-download.sh
2445
```
2546

2647
## Perform the prediction
2748

28-
```
49+
```bash
2950
sbatch batch-runmodel.sh
3051
```
3152

3253
## Display the resulting image
3354

3455
Assuming you have X-forwarding enabled on your Baskerville session you can display the resulting image on your local machine by running the following.
3556

36-
```
57+
```bash
3758
module load ImageMagick/7.1.0-37-GCCcore-11.3.0
3859
magick display plots.pdf
3960
```
@@ -43,79 +64,21 @@ magick display plots.pdf
4364
For fine-tuning the same data download can be used.
4465
You can then immediately perform finetuning with the small (debug) modeul on a 40 GiB A100 with the following.
4566

46-
```
67+
```bash
4768
sbatch batch-finetune-small.sh
4869
```
4970

5071
## Fine-tuning the standard model
5172

52-
Currently fine-tuning the standard model fails on an 80 GiB A100 GPU due to out-of-memory errors.
53-
You can try this yourself with the following:
73+
There are four versions of the fine tuning process for the standard model: DDP, FSDP, Aligned and a preliminary version.
74+
The last of these is for historical interest and shows the development of the process, but won't run on Baskerville A100 with80 GiB of memory due to out of memory errors.
75+
This preliminary version uses a simplified loss function rather than the loss function specified in the paper and which is likely to be the source of these errors.
5476

55-
```
56-
sbatch batch-finetune.sh
57-
```
77+
To test out the different versious the following commands can be used:
5878

59-
Alternatively to run the same fine-tuning code that works on DAWN, run the following:
60-
61-
```
79+
```bash
80+
sbatch batch-finetune-ddp.sh
81+
sbatch batch-finetune-fsdp.sh
6282
sbatch batch-finetune-aligned.sh
63-
```
64-
65-
The resulting errors looks like this:
66-
67-
```log
68-
/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
69-
return self.fget.__get__(instance, owner)()
70-
loading model...
71-
loading data...
72-
batching...
73-
preparing model...
74-
performing forward pass...
75-
calculating loss...
76-
performing backward pass...
77-
Traceback (most recent call last):
78-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/baskerville/era5-prediction/finetune-fsdp.py", line 88, in <module>
79-
loss.backward()
80-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
81-
torch.autograd.backward(
82-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
83-
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
84-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
85-
return user_fn(self, *args)
86-
^^^^^^^^^^^^^^^^^^^^
87-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 271, in backward
88-
outputs = ctx.run_function(*detached_inputs)
89-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
90-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 153, in my_function
91-
return self._checkpoint_wrapped_module(
92-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
93-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
94-
return self._call_impl(*args, **kwargs)
95-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
96-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
97-
return forward_call(*args, **kwargs)
98-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
99-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 722, in forward
100-
x = blk(x, c, res, rollout_step)
101-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
102-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
103-
return self._call_impl(*args, **kwargs)
104-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
105-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
106-
return forward_call(*args, **kwargs)
107-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
108-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 486, in forward
109-
attn_windows = self.attn(x_windows, mask=attn_mask, rollout_step=rollout_step)
110-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
111-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
112-
return self._call_impl(*args, **kwargs)
113-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
114-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
115-
return forward_call(*args, **kwargs)
116-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
117-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 161, in forward
118-
x = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=attn_dropout)
119-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
120-
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 570.00 MiB. GPU 0 has a total capacty of 79.25 GiB of which 103.50 MiB is free. Including non-PyTorch memory, this process has 79.14 GiB memory in use. Of the allocated memory 76.38 GiB is allocated by PyTorch, and 2.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
83+
sbatch batch-finetune.sh
12184
```

baskerville/era5-experiments/batch-download.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
#SBATCH --nodes 1
77
#SBATCH --gpus 1
88
#SBATCH --cpus-per-gpu 36
9-
#SBATCH --job-name auroria-prepare
10-
#SBATCH --output log-prepare.txt
9+
#SBATCH --job-name aurora-prepare
10+
#SBATCH --output log-download.txt
1111

1212
# Execute using:
1313
# sbatch ./batch-prepare.sh
@@ -37,12 +37,12 @@ module -q load PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1
3737
echo
3838
echo "## Initialising virtual environment"
3939

40-
python -m venv venv
40+
python3.11 -m venv venv
4141
. ./venv/bin/activate
4242

4343
pip install --quiet --upgrade pip
4444
pip install --quiet cdsapi
45-
pip install --quiet -e ../../aurora
45+
pip install --quiet -e ../../.[bask]
4646

4747
echo
4848
echo "## Downloading data"

baskerville/era5-experiments/batch-finetune-aligned.sh

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
#SBATCH --gpus 1
88
#SBATCH --cpus-per-gpu 36
99
#SBATCH --constraint=a100_80
10-
#SBATCH --job-name auroria-finetune
11-
#SBATCH --output log-finetune.txt
10+
#SBATCH --job-name aurora-finetune-aligned
11+
#SBATCH --output log-finetune-aligned.txt
1212

1313
# Execute using:
1414
# sbatch ./batch-finetune.sh
@@ -19,7 +19,7 @@ echo "## Aurora fine-tuning script starting"
1919
# Quit on error
2020
set -e
2121

22-
if [ ! -d downloads ]; then
22+
if [ ! -d ../../downloads ]; then
2323
echo "Please run the batch-download.sh script to download the data."
2424
exit 1
2525
fi
@@ -36,12 +36,11 @@ module -q load PyTorch-bundle/2.1.2-foss-2023a-CUDA-12.1.1
3636
echo
3737
echo "## Initialising virtual environment"
3838

39-
python -m venv venv
39+
python3.11 -m venv venv
4040
. ./venv/bin/activate
4141

4242
pip install --quiet --upgrade pip
43-
pip install --quiet cdsapi
44-
pip install --quiet -e ../../aurora
43+
pip install --quiet -e ../../.[bask]
4544

4645
echo
4746
echo "## Running model"

baskerville/era5-experiments/batch-finetune-ddp.sh

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
#SBATCH --cpus-per-gpu 36
99
#SBATCH --mem 32768
1010
#SBATCH --constraint=a100_80
11-
#SBATCH --job-name auroria-finetune
12-
#SBATCH --output log-finetune.txt
11+
#SBATCH --job-name aurora-finetune-ddp
12+
#SBATCH --output log-finetune-ddp.txt
1313

1414
# Execute using:
1515
# sbatch ./batch-finetune.sh
@@ -20,7 +20,7 @@ echo "## Aurora fine-tuning script starting"
2020
# Quit on error
2121
set -e
2222

23-
if [ ! -d downloads ]; then
23+
if [ ! -d ../../downloads ]; then
2424
echo "Please run the batch-download.sh script to download the data."
2525
exit 1
2626
fi
@@ -42,12 +42,11 @@ export OMP_NUM_THREADS=1
4242
echo
4343
echo "## Initialising virtual environment"
4444

45-
python -m venv venv
45+
python3.11 -m venv venv
4646
. ./venv/bin/activate
4747

4848
pip install --quiet --upgrade pip
49-
pip install --quiet cdsapi
50-
pip install --quiet -e ../../aurora
49+
pip install --quiet -e ../../.[bask]
5150

5251
echo
5352
echo "## Running model"

0 commit comments

Comments
 (0)