Skip to content

Update NVIDIA driver symlink script#158

Draft
casparvl wants to merge 1 commit intoEESSI:mainfrom
casparvl:link_nvidia_drivers
Draft

Update NVIDIA driver symlink script#158
casparvl wants to merge 1 commit intoEESSI:mainfrom
casparvl:link_nvidia_drivers

Conversation

@casparvl
Copy link
Contributor

@casparvl casparvl commented Feb 4, 2026

We'll need the following variant symlinks to be in place before this script can work as intended:

ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia
ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/aarch64/lib/nvidia
ln -s '$(EESSI_202506_NVIDIA_OVERRIDE:-/cvmfs/software.eessi.io/defaults/nvidia)' /cvmfs/software.eessi.io/versions/2025.06/compat/linux/riscv64/lib/nvidia

And then:

ln -s '$(EESSI_NVIDIA_OVERRIDE_DEFAULT:-/dev/null)' /cvmfs/software.eessi.io/defaults/nvidia

This can then be quite easily tested from within the container:

./eessi_container.sh -a rw -r software.eessi.io -b $<host-software-layer-scripts>:/software-layer-scripts --nvidia all
cd /software-layer-scripts/scripts/gpu_support/nvidia
./link_nvidia_host_libraries.sh

This should error out stating that the variant symlink resolves to /dev/null. Then, you can change /etc/cvmfs/default.local to set e.g. EESSI_NVIDIA_OVERRIDE_DEFAULT (e.g. to /opt/eessi/nvidia) and run the linking script again - this should the install the symlinks.

@casparvl
Copy link
Contributor Author

casparvl commented Feb 4, 2026

Although we don't have the symlinks yet, I can actually already test this in the container - it will just create the symlinks in /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/ in the writeable overlay. That's fine.

What I did:

$ cd /software-layer-scripts/scripts/gpu_support/nvidia/
$ umask 0022
$  source /cvmfs/software.eessi.io/versions/2025.06/init/lmod/bash
# For some reason this failed to load the module - some module cache issue?
$ module load EESSI/2025.06
$ cat > dummy.c <<'EOF'
int main(void) { return 0; }
EOF
$ gcc -Wall -Wl,--no-as-needed -lcuda dummy.c -o dummy -L /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/
# singularity has /.singularity.d/libs with the CUDA drivers in the LD_LIBRARY_PATH, but those are not the ones we want to find...
$  unset LD_LIBRARY_PATH
$ ldd dummy
Apptainer> ldd dummy
        linux-vdso.so.1 (0x00007ffc59bb4000)
        libcuda.so.1 => /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/libcuda.so.1 (0x000014f19b377000)
...

Works as intended. After implementing the variant symlinks, we should retest, try to use the EESSI_NVIDIA_OVERRIDE_DEFAULT symlink, and, once that works, try again using the EESSI_202506_NVIDIA_OVERRIDE variant symlink.

@bedroge
Copy link
Contributor

bedroge commented Feb 17, 2026

Tested in the container using EESSI 2025.06 and without having configured the variant symlinks:

ERROR: /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia is a symlink pointing to /cvmfs/software.eessi.io/defaults/nvidia, which is a symlink pointing to /dev/null
If you want to symlink the drivers in a single location for all EESSI versions, please define the EESSI_NVIDIA_OVERRIDE_DEFAULT variant symlink in your local CVMFS configuration to point to writeable location. This will change the target of symlink /cvmfs/software.eessi.io/defaults/nvidia.
If you want to symlink the drivers only for this version of EESSI (2025.06), please define the EESSI__NVIDIA_OVERRIDE variant symlink in your local CVMFS configuration to point to writeable location. This will change the target of symlink /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia.

With the variant symlink reconfigured as EESSI_NVIDIA_OVERRIDE_DEFAULT=/opt/eessi/nvidia:

Ensure the final target of /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia (/opt/eessi/nvidia) exists
Host NVIDIA GPU drivers linked successfully for EESSI

Wiping that dir and doing it again using EESSI_202506_NVIDIA_OVERRIDE=/opt/eessi/nvidia yields the same result.

Also checked the symlinks, and the pointed to the expected locations.

msg="${msg} the EESSI_NVIDIA_OVERRIDE_DEFAULT variant symlink in your local CVMFS configuration to point to"
msg="${msg} writeable location. This will change the target of symlink ${target1}.\n"
msg="${msg}If you want to symlink the drivers only for this version of EESSI (${EESSI_VERSION}), please define"
msg="${msg} the EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE variant symlink in your local CVMFS configuration to point to"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg="${msg} the EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE variant symlink in your local CVMFS configuration to point to"
msg="${msg} the EESSI_${EESSI_VERSION//./}_NVIDIA_OVERRIDE variant symlink in your local CVMFS configuration to point to"

fatal_error "${msg}"
fi
else
msg="$target1 does not seem to be a CVMFS variant symlink, suggesting that EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg="$target1 does not seem to be a CVMFS variant symlink, suggesting that EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE"
msg="$target1 does not seem to be a CVMFS variant symlink, suggesting that EESSI_${EESSI_VERSION//./}_NVIDIA_OVERRIDE"

nvidia_trusted_dir="${EESSI_EPREFIX}/lib/nvidia"
if [[ -L "$nvidia_trusted_dir" ]]; then
target1=$(readlink "$nvidia_trusted_dir")
log_verbose "$nvidia_trusted_dir is a CVMFS variant symlink (EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE) currently pointing to $target1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_verbose "$nvidia_trusted_dir is a CVMFS variant symlink (EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE) currently pointing to $target1"
log_verbose "$nvidia_trusted_dir is a CVMFS variant symlink (EESSI_${EESSI_VERSION//./}_NVIDIA_OVERRIDE) currently pointing to $target1"

log_verbose "${msg}"

# Check if target2 isn't /dev/null (the default target of the EESSI_NVIDIA_OVERRIDE_DEFAULT variant symlink)
# If it is, suggest setting EESSI_NVIDIA_OVERRIDE_DEFAULT or EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If it is, suggest setting EESSI_NVIDIA_OVERRIDE_DEFAULT or EESSI_${ESSSI_VERSION//./}_NVIDIA_OVERRIDE
# If it is, suggest setting EESSI_NVIDIA_OVERRIDE_DEFAULT or EESSI_${EESSI_VERSION//./}_NVIDIA_OVERRIDE

# Do some checks on existence of links and that we don't end up at /dev/null (the default), so we can print some informative information
# One downside is that we can't explicitely check if something is a variant symlink, so we'll just assume that if it's a link AND it
# lives in our CVMFS repository, it must be a variant symlink
nvidia_trusted_dir="${EESSI_EPREFIX}/lib/nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the script will no longer work for 2023.06?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants