Commit 87e2f91
authored
[gpu] strict driver and cuda version assignment (#1275)
* [gpu] toward a more consistent driver and CUDA install
gpu/install_gpu_driver.sh
* exclusively using .run file installation method when available
* build nccl from source
* cache build artifacts from kernel driver and nccl
* Tested more CUDA minor versions
* gathering CUDA and driver version from URLs if passed
* Printing warnings when combination provided is known to fail
* waiting on apt lock when it exists
* wrapping expensive functions in completion checks to reduce re-run time
* fixed a problem with ops agent not installing ; using venv
* Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
* setting better spark defaults
* skipping proxy setup if http-proxy metadata not set
* added function to check secure-boot and os version compatability
gpu/manual-test-runner.sh
* order commands correctly
gpu/test_gpu.py
* clearer test skipping logic
* added instructions on how to test pyspark
* correcting driver for cuda 12.4
* correcting cuda subversion. 12.4.0 instead of 12.4.1 so that driver and cuda match up
* corrected cannonical 11.8 driver version ; removed extra code and comment ; added better description of what is in the runfile
* skipping most tests ; using 11.7 from the cuda 11 line instead of the less well supported 11.8
* verified that the cuda and driver versions match up
* reducing log capture
* temporarily increasing machine shape for build caching
* 64 is too many for a single T4
* added a subversion for 11.7
* add more tests to the install function
* only including architectures supported by this version of CUDA
* pinning down versions better ; more caching ; more ram disks ; new pytorch and tensorflow test functions
* using maximum from 8.9 series on rocky for 11.7
* skip full build
* pinning to bazel-7.4.0
* NCCL requires gcc-11 for cuda11
* rocky8 is now building from the source in the .run file
* reverting to previous state of only selecting a compiler version on latest releases
* replaced literal path names with variable values ; indexing builds by the signing key used
* moved variable definition to prepare function ; moved driver signing to build phase
* test whether variable is defined before checking its value
* cache only the bins and logs
* build index of kernel modules after unpacking ; remove call to non-existent function
* only build module dependency index once
* skipping CUDA 11 NCCL build on debian12
* skip cuda11 on debian12, rocky9
* renamed verify_pyspark to verify_instance_pyspark
* failing somewhat gracefully ; skipping tests that would fail
* skipping single node tests for rocky8
* re-enable other tests
* Specifying bazel version with variable
* fixing up some skip logic
* replaced OS_NAME with _shortname
* skip more single instance tests for rocky8
* fixing indentation ; skipping redundant test
* remove retries of flakey tests
* oops ; need to define the cuda version to test for
* passing -q to gcloud to generate empty passphrase if no ssh key exists ; selecting a more modern version of the 550 driver
* including instructions on how to create a secure-boot key pair
* -e for expert, not -p for pro
* updated 11.8 and 12.0 driver versions
* added a signature check test which allows granular selection of platform to test, but does not yet verify signatures
* tuning the layout of arguments to userspace.run
* scoping DEFAULT_CUDA_VERSION correctly ; exercising rocky including kerberos on 12.6
* add a connect timeout to the ssh call instead of trying to patch around a longer than expected connection delay
* add some entropy to the process
* perhaps a re-run would have fixed 2.0-rocky8 on that last run
* increasing init action timeout to account for uncached builds
* cache non-open kernel build results
* per-kernel sub-directory for kmod tarballs
* using upstream repo and branch
* corrected grammar error
* testing Kerberos some more
* better implementation of numa node selection
* this time with a test which is exercised
* skip debian11 on Kerberos
* also skipping 2.1-ubuntu20 on kerberos clusters
* re-adjusting tests to be performed ; adjusting rather than skipping known failure cases
* more temporal variance
* skipping CUDA=12.0 for ubuntu22
* kerberos not known to succeed on 2.0-rocky8
* 2.2 dataproc images do not support CUDA <= 12.0
* skipping SINGLE configuration for rocky8 again
* not testing 2.0
* trying without test retries ; retries should happen within the test, not by re-running the test
* kerberos only works on 2.2
* using expectedFailure instead of skipTest for tests which are known to fail
* document one of the failure states
* skipping expected failures
* updated manual-test-runner.sh instructions
* this one generated from template after refactor
* do not point to local rpm pgp key
* re-ordering to reduce delta from master
* custom image usage can come later
* see #1283
* replaced incorrectly removed presubmit.sh and removed custom image key creation script intended to be removed in 70f37b6
* revert nearly to master
* can include extended test suite later
* order commands correctly
* placing all completion files in a common directory
* extend supported version list to include latest release of each minor version and their associated driver
* tested with CUDA 11.6.2/510.108.03
* nccl build completes successfully on debian10
* account for nvidia-smi ABI change post 11.6
* exercised with cuda 11.1
* cleaned up nccl build and pack code a bit
* no longer installing cudnn from local debian repo
* unpacking nccl from cache immediately rather than waiting until
later in the code
* determine cudnn version by what is available in the repo
* less noise from apt-mark hold
* nccl build tested on 11.1 and 11.6
* account for abi change in nvidia-smi
* reverting cloudbuild/Dockerfile to master
* nvidia is 404ing for download.nvidia.com ; using us.download.nvidia.com
* skipping rocky9
* * adding version 12.6 to the support matrix
* changing layout of gcs package folder
* install_pytorch function created and called when cuDNN is being installed
* incorrect version check removed
* only install pytorch if include-pytorch metadata set to true
* since call to install_pytorch is protected by metadata check, skip metadata check within the function ; create new function harden_sshd_config and call it
* increasing timeout and machine shape to reduce no-cache build time
* skip full test run due to edits to integration_tests directory
* ubuntu18 does not know about kex-gss ; use correct driver version number for cuda 11.1.1 url generation
* on rocky9 sshd service is called sshd instead of ssh as the rest of the platforms call it
* kex-gss is new in debian11
* all rocky call it sshd it seems
* cudnn no longer available on debian10
* compared with #1282 ; this change matches parity more closely
* slightly better variable declaration ordering ; it is better still in the templates/ directory from #1282
* install spark rapids
* cache the results of nvidia-smi --query-gpu
* reduce development time
* exercising more CUDA variants ; testing whether tests fail on long runs
* try to reduce concurrent builds ; extend build time further ; only enable spark rapids on images >= 2.1
* fixed bug with spark rapids version assignment ; more conservative about requirements for ramdisk ; roll back spark.SQLPlugin change
* * gpu does not work on capacity scheduler on dataproc 2.0 ; use fair
* protect against race condition on removing the .building files
* add logic for pre-11.7 cuda package repo back in
* clean up and verify yarn config
* revert test_install_gpu_cuda_nvidia_with_spark_job cuda versions
* configure for use with JupyterLab
* 2.2 should use 12.6.3 (latest)
* Addressing review from cnauroth
gpu/install_gpu_driver.sh:
* use the same retry arguments in all calls to curl
* correct 12.3's driver and sub-version
* improve logic for pause as other workers perform build
* remove call to undefined clear_nvsmi_cache
* move closing "fi" to line of its own
* added comments for unclear logic
* removed commented code
* remove unused curl for latest driver version
gpu/test_gpu.py
* removed excess test
* added comment about numa node selection
* removed skips of rocky9 ; 2.2.44-rocky9 build succeeds
* reverting changes to presubmit.sh1 parent 0b3165c commit 87e2f91
File tree
6 files changed
+1527
-511
lines changed- gpu
- integration_tests
6 files changed
+1527
-511
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | | - | |
28 | | - | |
| 29 | + | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
| |||
0 commit comments