Skip to content

Conversation

@ax3l
Copy link
Member

@ax3l ax3l commented Nov 13, 2025

Finalize a proper, pre-build WarpX for Perlmutter GPUs with GPU-aware/GPUdirect MPI.

To Do

  • Builds
  • Finalize Entrypoint

Follow-Up

  • Finalize MPI tuning based on INC0245154 response/guidance
  • Ensure it runs on Perlmutter
  • Ensure GPU-aware MPI/GPUdirect works
    • Ensure Slingshot is used optimally (something something Cray MPICH)
  • Build all WarpX dims
  • Docs

@ax3l ax3l added backend: cuda Specific to CUDA execution (GPUs) component: documentation Docs, readme and manual install machine / system Machine or system-specific issue labels Nov 13, 2025
@ax3l ax3l force-pushed the doc-proper-pm-container branch 6 times, most recently from 5f90655 to b7f0d77 Compare November 14, 2025 22:55
Finalize a proper, pre-build WarpX for Perlmutter
GPUs with GPU-aware/GPUdirect MPI.
@ax3l ax3l force-pushed the doc-proper-pm-container branch from b7f0d77 to c7f3cd5 Compare November 14, 2025 22:55
./configure \
--disable-fortran \
--prefix=/opt/warpx \
--with-ch4-shmmods=posix,gpudirect \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our NERSC Ticket:

Rahulkumar Gayatri (rgayatri)

Hey Axel, Adam:
Just FYI - If the plan is to build mpich inside the container and then replace it at runtime with cray-mpich, make sure that the mpich inside the container is built WITHOUT cuda, since that interferes with cuda-aware-mpi of cray-mpich at runtime for some reason.

Regards,
Rahul.

Copy link
Member Author

@ax3l ax3l Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written above, slightly counter intuitively, if we do not compile with gpudirect it is possible for NERSC to automatically swap out our MPI libs as we start up the container (they copy in the cray MPI libs and resquash the image on startup):

Suggested change
--with-ch4-shmmods=posix,gpudirect \

# WarpX Python bindings are installed in /opt/venv
#
# On Perlmutter, run WarpX like this:
# podman run --rm --gpu --mpi --nccl -it warpx-perlmutter warpx.2d inputs_2d
Copy link
Member Author

@ax3l ax3l Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh, the podman-hpc --mpi plugin is broken on PM, fails in startup.

The moment I put --mpi or --cuda-mpi in, it starts to wildly connect to external registries and fails
Screenshot from 2025-11-18 10-28-51
https://docs.nersc.gov/development/containers/podman-hpc/overview/#using-podman-hpc-as-a-container-runtime

Copy link
Member Author

@ax3l ax3l Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone is at SC25 this week, so no progress on my NERSC bug report in INC0245154 yet.

I posted a reproducer therein: https://github.com/ax3l/warpx/tree/doc-proper-pm-container-mpihellogpu/Tools/machines/perlmutter-nersc

Prevents swapping for Cray Libs with MPI Plugin,
per NERSC support
ARG mpich_prefix=mpich-$mpich

RUN \
curl -Lo $mpich_prefix.tar.gz https://www.mpich.org/static/downloads/$mpich/$mpich_prefix.tar.gz && \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to run with GPU-unaware MPI to work-around the Podman-HPC MPI plugin issue, we would need to patch in this to avoid an assert in startup of AMReX, which uses this function: pmodels/mpich#5720
https://github.com/AMReX-Codes/amrex/blob/25.11/Src/Base/AMReX_ParallelDescriptor.cpp#L1547-L1549

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ax3l ax3l changed the title [WIP] Perlmutter: GPU Docker Container Perlmutter: GPU Docker Container Nov 25, 2025
@ax3l ax3l marked this pull request as ready for review November 25, 2025 19:28
@ax3l ax3l requested review from EZoni and RemiLehe November 25, 2025 19:28
@ax3l
Copy link
Member Author

ax3l commented Nov 25, 2025

@RemiLehe I would merge this update, as it is a good basis for sharing with other power-developers.
I intentionally add no docs outside of inline comments.

Follow-ups will A) generalize this once NERSC is unbroken for MPI B) make a dead-end PR that patches this to build a one-off no-MPI variant for our LDRD.

@ax3l ax3l merged commit 57e114e into BLAST-WarpX:development Nov 25, 2025
82 checks passed
@ax3l ax3l deleted the doc-proper-pm-container branch November 25, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend: cuda Specific to CUDA execution (GPUs) component: documentation Docs, readme and manual install machine / system Machine or system-specific issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants