-
Notifications
You must be signed in to change notification settings - Fork 231
Perlmutter: GPU Docker Container #6389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5f90655 to
b7f0d77
Compare
Finalize a proper, pre-build WarpX for Perlmutter GPUs with GPU-aware/GPUdirect MPI.
b7f0d77 to
c7f3cd5
Compare
| ./configure \ | ||
| --disable-fortran \ | ||
| --prefix=/opt/warpx \ | ||
| --with-ch4-shmmods=posix,gpudirect \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our NERSC Ticket:
Rahulkumar Gayatri (rgayatri)
Hey Axel, Adam:
Just FYI - If the plan is to build mpich inside the container and then replace it at runtime with cray-mpich, make sure that the mpich inside the container is built WITHOUT cuda, since that interferes with cuda-aware-mpi of cray-mpich at runtime for some reason.Regards,
Rahul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written above, slightly counter intuitively, if we do not compile with gpudirect it is possible for NERSC to automatically swap out our MPI libs as we start up the container (they copy in the cray MPI libs and resquash the image on startup):
| --with-ch4-shmmods=posix,gpudirect \ |
| # WarpX Python bindings are installed in /opt/venv | ||
| # | ||
| # On Perlmutter, run WarpX like this: | ||
| # podman run --rm --gpu --mpi --nccl -it warpx-perlmutter warpx.2d inputs_2d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh, the podman-hpc --mpi plugin is broken on PM, fails in startup.
The moment I put --mpi or --cuda-mpi in, it starts to wildly connect to external registries and fails

https://docs.nersc.gov/development/containers/podman-hpc/overview/#using-podman-hpc-as-a-container-runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone is at SC25 this week, so no progress on my NERSC bug report in INC0245154 yet.
I posted a reproducer therein: https://github.com/ax3l/warpx/tree/doc-proper-pm-container-mpihellogpu/Tools/machines/perlmutter-nersc
Prevents swapping for Cray Libs with MPI Plugin, per NERSC support
| ARG mpich_prefix=mpich-$mpich | ||
|
|
||
| RUN \ | ||
| curl -Lo $mpich_prefix.tar.gz https://www.mpich.org/static/downloads/$mpich/$mpich_prefix.tar.gz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to run with GPU-unaware MPI to work-around the Podman-HPC MPI plugin issue, we would need to patch in this to avoid an assert in startup of AMReX, which uses this function: pmodels/mpich#5720
https://github.com/AMReX-Codes/amrex/blob/25.11/Src/Base/AMReX_ParallelDescriptor.cpp#L1547-L1549
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@RemiLehe I would merge this update, as it is a good basis for sharing with other power-developers. Follow-ups will A) generalize this once NERSC is unbroken for MPI B) make a dead-end PR that patches this to build a one-off no-MPI variant for our LDRD. |
Finalize a proper, pre-build WarpX for Perlmutter GPUs with GPU-aware/GPUdirect MPI.
To Do
Follow-Up