You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates the scripts for queuing using sbatch or running within an interactive
shell on a compute node using srun.
The objective with these changes is to ensure:
1. Everything works simply by running the scripts.
2. There's an element of consistency across the various scripts.
The minimum xarray version has been increased on Baskerville to 2023.06.0 in
order to avoid Issue #7880 which I experienced consistently when executing the
runmodel.py version. See here for the relevant xarray issue and changelogs:
pydata/xarray#7880https://github.com/pydata/xarray/releases/tag/v2023.06.0
Then source the following file to set up the environment:
48
-
```
48
+
```bash
49
49
. ./batch-srun.sh
50
50
```
51
51
52
-
Finally run the graph generation script. Any errors will cause the `srun` session to abort, so we block error return values when running this for convenience during development.
53
-
```
54
-
python compare-results.py || true
52
+
Finally run the graph generation script.
53
+
The value 4 passed in as the `-n` parameter is the number of `preds` files to use.
54
+
In general this should be left as four to match the files generated as explained above.
This will set up modules, environment and virtual environment.
33
+
You can then run scripts directly, for example:
34
+
35
+
```bash
36
+
python download.py
37
+
```
38
+
39
+
## Download the data
40
+
41
+
This will download the data to the `aurora-hpc/downloads` directory.
42
+
43
+
```bash
23
44
sbatch batch-download.sh
24
45
```
25
46
26
47
## Perform the prediction
27
48
28
-
```
49
+
```bash
29
50
sbatch batch-runmodel.sh
30
51
```
31
52
32
53
## Display the resulting image
33
54
34
55
Assuming you have X-forwarding enabled on your Baskerville session you can display the resulting image on your local machine by running the following.
35
56
36
-
```
57
+
```bash
37
58
module load ImageMagick/7.1.0-37-GCCcore-11.3.0
38
59
magick display plots.pdf
39
60
```
@@ -43,79 +64,21 @@ magick display plots.pdf
43
64
For fine-tuning the same data download can be used.
44
65
You can then immediately perform finetuning with the small (debug) modeul on a 40 GiB A100 with the following.
45
66
46
-
```
67
+
```bash
47
68
sbatch batch-finetune-small.sh
48
69
```
49
70
50
71
## Fine-tuning the standard model
51
72
52
-
Currently fine-tuning the standard model fails on an 80 GiB A100 GPU due to out-of-memory errors.
53
-
You can try this yourself with the following:
73
+
There are four versions of the fine tuning process for the standard model: DDP, FSDP, Aligned and a preliminary version.
74
+
The last of these is for historical interest and shows the development of the process, but won't run on Baskerville A100 with80 GiB of memory due to out of memory errors.
75
+
This preliminary version uses a simplified loss function rather than the loss function specified in the paper and which is likely to be the source of these errors.
54
76
55
-
```
56
-
sbatch batch-finetune.sh
57
-
```
77
+
To test out the different versious the following commands can be used:
58
78
59
-
Alternatively to run the same fine-tuning code that works on DAWN, run the following:
60
-
61
-
```
79
+
```bash
80
+
sbatch batch-finetune-ddp.sh
81
+
sbatch batch-finetune-fsdp.sh
62
82
sbatch batch-finetune-aligned.sh
63
-
```
64
-
65
-
The resulting errors looks like this:
66
-
67
-
```log
68
-
/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
69
-
return self.fget.__get__(instance, owner)()
70
-
loading model...
71
-
loading data...
72
-
batching...
73
-
preparing model...
74
-
performing forward pass...
75
-
calculating loss...
76
-
performing backward pass...
77
-
Traceback (most recent call last):
78
-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/baskerville/era5-prediction/finetune-fsdp.py", line 88, in <module>
79
-
loss.backward()
80
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
81
-
torch.autograd.backward(
82
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
83
-
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
84
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
85
-
return user_fn(self, *args)
86
-
^^^^^^^^^^^^^^^^^^^^
87
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 271, in backward
88
-
outputs = ctx.run_function(*detached_inputs)
89
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
90
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 153, in my_function
91
-
return self._checkpoint_wrapped_module(
92
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
93
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
94
-
return self._call_impl(*args, **kwargs)
95
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
96
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
97
-
return forward_call(*args, **kwargs)
98
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
99
-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 722, in forward
100
-
x = blk(x, c, res, rollout_step)
101
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
102
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
103
-
return self._call_impl(*args, **kwargs)
104
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
105
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
106
-
return forward_call(*args, **kwargs)
107
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
108
-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 486, in forward
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
112
-
return self._call_impl(*args, **kwargs)
113
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
114
-
File "/bask/apps/live/EL8-ice/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
115
-
return forward_call(*args, **kwargs)
116
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
117
-
File "/bask/projects/u/usjs9456-ati-test/ovau2564/aurora/aurora-hpc/aurora/aurora/model/swin3d.py", line 161, in forward
118
-
x = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=attn_dropout)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 570.00 MiB. GPU 0 has a total capacty of 79.25 GiB of which 103.50 MiB is free. Including non-PyTorch memory, this process has 79.14 GiB memory in use. Of the allocated memory 76.38 GiB is allocated by PyTorch, and 2.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0 commit comments