A multi-GPUs test for Oceananigans #4791

XinghaoJ · 2025-09-17T09:51:19Z

XinghaoJ
Sep 17, 2025

I can successfully run my simulation with single GPU, but after I changed the architecture from GPU() to Distributed(GPU()), errors occour.

Below is simplified julia code for simulation:

using Random
using Printf
using CUDA
using Oceananigans
using Oceananigans.Units: minute, minutes, hour, hours, days,day
using MPI
MPI.Init()
#grid setup
....
child_architecture = GPU()
architecture = Distributed(child_architecture)
grid = RectilinearGrid(architecture, size = (Nx, Ny, Nz),
                          x = (0, Lx),
                          y = (0, Ly),
                          z = z_faces,
                          halo = (3,3,3))
##boundary conditions
...
u_bsc = FieldBoundaryConditions(top=FluxBoundaryCondition(τx));
v_bsc = FieldBoundaryConditions(top=FluxBoundaryCondition(τy));
b_bsc = FieldBoundaryConditions(top=FluxBoundaryCondition(B_surf),bottom = GradientBoundaryCondition(dbdz_interior));
boundary_conditions=(u=u_bsc, v=v_bsc, b=b_bsc);   
##initial conditions and model setup
bᵢ(x,y,z) = ...
gaussian_mask = GaussianMask{:z}(center=-grid.Lz, width=grid.Lz/10)
u_sponge = v_sponge = w_sponge = ...
b_sponge = ...
model = NonhydrostaticModel(; grid,
                       advection = WENO(order=5),
                       buoyancy = BuoyancyTracer(),
                       coriolis = FPlane(f=0.729e-4),
                       closure = AnisotropicMinimumDissipation(),
                       boundary_conditions,
                       tracers = :b,
                       timestepper = :RungeKutta3,
                       forcing = (u=u_sponge, v=v_sponge, w=w_sponge, b=b_sponge)            
set!(model, b=bᵢ)
simulation = Simulation(model, Δt=5.0, stop_time=2days)
wizard = TimeStepWizard(cfl=1.0, max_change=1.1, max_Δt=10minute)
simulation.callbacks[:wizard] = Callback(wizard, IterationInterval(10))
##Print a progress message
progress_message(sim) = @printf("Iteration: %04d, time: %s, Δt: %s, max(|w|) = %.1e ms⁻¹, wall time: %s\n",
                                iteration(sim), prettytime(sim), prettytime(sim.Δt),
                                maximum(abs, sim.model.velocities.w), prettytime(sim.run_wall_time))
add_callback!(simulation, progress_message, IterationInterval(20))
run!(simulation)

I run this julia script in an interactive mode with 2 GPUs:
$ srun -p acd_ue -n 2 --mem=8G --gres=gpu:2 --time=02:00:00 --pty bash

The run commands are:

module load julia/1.10.9
module load mpi/openmpi-4.1.5
echo "run_begin at $(date)" | tee output.txt error.txt
{
                echo "Starting Julia script at $(date)"
                mpirun -n 2 bash -c 'export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_RANK)); julia long_test.jl'
                echo "Finished Julia script at $(date)"
                        } >> output.txt 2>> error.txt
echo "run_finish at $(date)" | tee -a output.txt error.txt

However, I got error.txt as below:

The code in long_test.jl:96 is:

set!(model, b=bᵢ)

Apart from errors, I also have doubts about if it's right to wrrite command as

mpirun -n 2 bash -c 'export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_RANK)); julia long_test.jl'

to run our simulation in two GPUs?

Answered by simone-silvestri

Sep 19, 2025

it might be that your cuda-aware MPI is not enabled. I see that you are using openmpi. You can try doing

using MPI
MPI.has_cuda()

if the output is false, then it means that cuda-aware MPI is not active. You can find more information here https://juliaparallel.org/MPI.jl/stable/knownissues/#CUDA-aware-MPI

View full answer

glwagner · 2025-09-17T11:34:53Z

glwagner
Sep 17, 2025
Maintainer

I could be wrong but I don't think you need to manually set the visible devices. With two processes you will use device 0 and 1 by default anyways?

Since the segmentation fault occurs at set!, I have to suspect that the problem lies in the definition of bᵢ. In your script you wrote

bᵢ(x,y,z) = ...

so I am not sure what the problem is. But one way to test this is to use bᵢ(x,y,z) = 0. If that fixes the problem, you have positively identified that the issue lies in the definition of bᵢ. Let us know!

6 replies

glwagner Sep 17, 2025
Maintainer

If set!(model, b=0) also segfaults, that it interesting. I believe that could imply the problem comes from the pressure solver --- because we do invoke MPI within the model constructor, eg update_state! here:

Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_model.jl

Line 241 in 422300a

update_state!(model; compute_tendencies = false)

which fills halos and communicates between devices.

But set! additionally computes a pressure correction

Oceananigans.jl/src/Models/NonhydrostaticModels/set_nonhydrostatic_model.jl

Lines 47 to 57 in 422300a

    
           # Apply a mask 
        
           foreach(mask_immersed_field!, model.tracers) 
        
           foreach(mask_immersed_field!, model.velocities) 
        
           update_state!(model; compute_tendencies = false) 
        
           if enforce_incompressibility 
        
               FT = eltype(model.grid) 
        
               compute_pressure_correction!(model, one(FT)) 
        
               make_pressure_correction!(model, one(FT)) 
        
               update_state!(model; compute_tendencies = false) 
        
           end

XinghaoJ Sep 18, 2025
Author

Sorry, I thought the format is recognized automatically. I will format it for later comments.

XinghaoJ Sep 19, 2025
Author

I don't know why [gpu1-48:3126332] Read -1, expected 889824, errno = 14 occurs once the code runs to

model = NonhydrostaticModel(;grid=distributed_grid,  
                       advection = WENO(order=5), 
                       buoyancy = BuoyancyTracer(),
                       coriolis = FPlane(f=0.729e-4),
                       closure = AnisotropicMinimumDissipation(),
                       boundary_conditions,
                       tracers = :b,
                       timestepper = :RungeKutta3)

And than errors show up at next line:

set!(model, b=bᵢ)

If I comment set!(model, b=bᵢ) and its subsequent code lines, the warning [gpu1-48:3126332] Read -1, expected 889824, errno = 14 still exist, and any additional code, like println("model build finished") will cause the errors "Segementation fault". If this means the information transfer for halo between distributed grids can not operate normally for buliding nonhydrostatic model?

simone-silvestri Sep 19, 2025
Maintainer

it might be that your cuda-aware MPI is not enabled. I see that you are using openmpi. You can try doing

using MPI
MPI.has_cuda()

if the output is false, then it means that cuda-aware MPI is not active. You can find more information here https://juliaparallel.org/MPI.jl/stable/knownissues/#CUDA-aware-MPI

Answer selected by XinghaoJ

Tinydog8 · 2025-12-03T08:49:40Z

Tinydog8
Dec 3, 2025

Hi, I have got a similar problem. How did you fix it? Thanks！

using CUDA
using Oceananigans
using Oceananigans.Units: minute, minutes, hours
using Oceananigans.Units: GiB, MiB, KiB
using Statistics
using Oceananigans.TurbulenceClosures: viscous_flux_uz
using CairoMakie
using MPI


if !MPI.Initialized()
    MPI.Init()
end

child_architecture = GPU()
architecture = Distributed(child_architecture)

const H=15 #/m
grid = RectilinearGrid(architecture, size=(128,128,128), extent=(π*H, π*H/2, H),halo = (4,4,4))

const u★=0.01 #friction velocity
const g_Earth = 9.81 # m/s

Fx(x,y,z,t)=u★^2/H #forcing
z₀ = 0.01 # m (roughness length)
κ = 0.4 # von Karman constant
z₁ = grid.Lz+znodes(grid, Center())[1] # Closest grid center to the bottom
cᴰᵇ = (κ / log(z₁ / z₀))^2 # Drag coefficient

@inline drag_u(x, y, t, u, v, p) = - p.cᴰᵇ * √(u^2 + v^2) * (u)
@inline drag_v(x, y, t, u, v, p) = - p.cᴰᵇ * √(u^2 + v^2) * (v)

drag_bc_u = FluxBoundaryCondition(drag_u, field_dependencies=(:u, :v), parameters=(; cᴰᵇ))
drag_bc_v = FluxBoundaryCondition(drag_v, field_dependencies=(:u, :v), parameters=(; cᴰᵇ))

u_bcs = FieldBoundaryConditions(bottom = drag_bc_u)
v_bcs = FieldBoundaryConditions(bottom = drag_bc_v)

T_bcs = FieldBoundaryConditions(top=FluxBoundaryCondition(0.0),
                                bottom=FluxBoundaryCondition(0.0))
S_bcs = FieldBoundaryConditions(top=FluxBoundaryCondition(0.0),bottom=FluxBoundaryCondition(0.0))

model = NonhydrostaticModel(; grid, 
                            advection = WENO(order=9),
                            timestepper = :RungeKutta3,
                            tracers =(:T,:S),
                            buoyancy = SeawaterBuoyancy(),
                            closure = AnisotropicMinimumDissipation(),
                            boundary_conditions = (u=u_bcs,v=v_bcs,T=T_bcs,S=S_bcs),
                            forcing=(u=Fx,))



Ξ(z) = randn()*exp(z/4)

uᵢ(x, y, z) = u★ * 1e-2 * Ξ(z) + 0.21
wᵢ(x, y, z) = u★ * 1e-2 * Ξ(z)
vᵢ(x, y, z) = u★ * 1e-2 * Ξ(z)
Tᵢ=290 #K
Sᵢ=35 #PSU

set!(model)

simulation = Simulation(model, Δt=0.4, stop_time=24hours)
wizard = TimeStepWizard(cfl=0.8, max_change=1.1, max_Δt=0.1minute)
simulation.callbacks[:wizard] = Callback(wizard, IterationInterval(10))

using Printf

function progress(simulation)
    u, v, w = simulation.model.velocities

    ## Print a progress message
    msg = @sprintf("i: %04d, t: %s, Δt: %s, umax = (%.5e, %.5e, %.5e) ms⁻¹, wall time: %s\n",
                   iteration(simulation),
                   prettytime(time(simulation)),
                   prettytime(simulation.Δt),
                   maximum(abs, u), maximum(abs, v), maximum(abs, w),
                   prettytime(simulation.run_wall_time))

    @info msg

    return nothing
end

simulation.callbacks[:progress] = Callback(progress, IterationInterval(50))

run!(simulation)

[1636750] signal (11.2): Segmentation fault
in expression starting at /share/home/jzh184/Case/ChannelFlow/MultiGPU_test/ChannelFlow.jl:155
__memcpy_avx_unaligned_erms at /lib64/libc.so.6 (unknown line)
MPIR_Typerep_pack at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
lmt_shm_send_progress at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
MPID_nem_lmt_shm_start_send at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
pkt_CTS_handler at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
MPIDI_CH3I_Progress at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
MPIR_Waitall_impl at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
MPIR_Waitall at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
internal_Waitall at /share/home/jzh184/.julia/artifacts/117d9048128625bb3418b0b4cca48739c5d28740/lib/libmpi.so (unknown line)
MPI_Waitall at /share/home/jzh184/.julia/packages/MPI/hNJm0/src/api/generated_api.jl:2412 [inlined]
Waitall at /share/home/jzh184/.julia/packages/MPI/hNJm0/src/nonblocking.jl:551
Waitall at /share/home/jzh184/.julia/packages/MPI/hNJm0/src/nonblocking.jl:546 [inlined]
Waitall at /share/home/jzh184/.julia/packages/MPI/hNJm0/src/nonblocking.jl:560 [inlined]
cooperative_waitall! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:164 [inlined]
pool_requests_or_complete_comm! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:126 [inlined]
#fill_halo_event!#65 at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:183
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
jl_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/julia.h:1982 [inlined]
do_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/builtins.c:768
fill_halo_event! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:169
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
jl_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/julia.h:1982 [inlined]
do_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/builtins.c:768
#fill_halo_regions!#63 at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:96
fill_halo_regions! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:87 [inlined]
#fill_halo_regions!#62 at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:77 [inlined]
fill_halo_regions! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/DistributedComputations/halo_communication.jl:77
unknown function (ip: 0x148aee151a00)
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
jl_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/julia.h:1982 [inlined]
do_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/builtins.c:768
#fill_halo_regions!#85 at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/Fields/field_tuples.jl:60
fill_halo_regions! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/Fields/field_tuples.jl:57
unknown function (ip: 0x148aee14f730)
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
compute_pressure_correction! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/Models/NonhydrostaticModels/pressure_correction.jl:12
#set!#13 at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/Models/NonhydrostaticModels/set_nonhydrostatic_model.jl:56
set! at /share/home/jzh184/.julia/packages/Oceananigans/VsQ7B/src/Models/NonhydrostaticModels/set_nonhydrostatic_model.jl:33
unknown function (ip: 0x148aee14dee5)
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
jl_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/julia.h:1982 [inlined]
do_call at /cache/build/tester-amdci4-11/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/tester-amdci4-11/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/tester-amdci4-11/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/tester-amdci4-11/julialang/julia-master/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/tester-amdci4-11/julialang/julia-master/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/tester-amdci4-11/julialang/julia-master/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/tester-amdci4-11/julialang/julia-master/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/tester-amdci4-11/julialang/julia-master/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2146
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
_include at ./loading.jl:2206
include at ./Base.jl:495
jfptr_include_46550.1 at /share/home/jzh184/julia-1.10.8/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
exec_options at ./client.jl:323
_start at ./client.jl:557
jfptr__start_82923.1 at /share/home/jzh184/julia-1.10.8/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/tester-amdci4-11/julialang/julia-master/src/gf.c:3077
jl_apply at /cache/build/tester-amdci4-11/julialang/julia-master/src/julia.h:1982 [inlined]
true_main at /cache/build/tester-amdci4-11/julialang/julia-master/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/tester-amdci4-11/julialang/julia-master/src/jlapi.c:731
main at /cache/build/tester-amdci4-11/julialang/julia-master/cli/loader_exe.c:58
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 73150248 (Pool: 73088604; Big: 61644); GC: 64

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1636750 RUNNING AT g001
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A multi-GPUs test for Oceananigans #4791

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

A multi-GPUs test for Oceananigans #4791

Uh oh!

Uh oh!

XinghaoJ Sep 17, 2025

Replies: 2 comments · 6 replies

Uh oh!

glwagner Sep 17, 2025 Maintainer

Uh oh!

glwagner Sep 17, 2025 Maintainer

Uh oh!

XinghaoJ Sep 18, 2025 Author

Uh oh!

XinghaoJ Sep 19, 2025 Author

Uh oh!

simone-silvestri Sep 19, 2025 Maintainer

Uh oh!

Uh oh!

Tinydog8 Dec 3, 2025

XinghaoJ
Sep 17, 2025

Replies: 2 comments 6 replies

glwagner
Sep 17, 2025
Maintainer

glwagner Sep 17, 2025
Maintainer

XinghaoJ Sep 18, 2025
Author

XinghaoJ Sep 19, 2025
Author

simone-silvestri Sep 19, 2025
Maintainer

Tinydog8
Dec 3, 2025