[WIP] Harden training script: credentials validation, error handling, security fixes#432
[WIP] Harden training script: credentials validation, error handling, security fixes#432EZoni wants to merge 2 commits into
Conversation
…ity fixes - Add `set -euo pipefail` for strict error handling - Validate registry and DB env files exist before sourcing; assert required variables are set - Use `--password-stdin` for podman-hpc login to avoid exposing password in process list - Add `curl -f` to fail on HTTP errors and `jq -e` to fail on null/false output - Quote variable references consistently
| @@ -1,4 +1,5 @@ | |||
| #!/bin/bash -l | |||
| set -euo pipefail | |||
There was a problem hiding this comment.
-e exits on error, -u catches unbound variables, and -o pipefail catches failures mid-pipeline.
| if [ ! -r "${REGISTRY_PROFILE}" ]; then | ||
| echo "Missing registry credentials file: ${REGISTRY_PROFILE}" >&2 | ||
| echo "Create it with REGISTRY_USER and REGISTRY_PASSWORD for ${REGISTRY_NAME}." >&2 | ||
| exit 1 | ||
| fi | ||
| source "${REGISTRY_PROFILE}" # credential variables: REGISTRY_USER and REGISTRY_PASSWORD | ||
| : "${REGISTRY_USER:?REGISTRY_USER must be set in ${REGISTRY_PROFILE}}" | ||
| : "${REGISTRY_PASSWORD:?REGISTRY_PASSWORD must be set in ${REGISTRY_PROFILE}}" | ||
|
|
||
| if [ ! -r "${DB_ENV_FILE}" ]; then | ||
| echo "Missing container environment file: ${DB_ENV_FILE}" >&2 | ||
| echo "Create it with SF_DB_READONLY_PASSWORD and AM_SC_API_KEY." >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Failing fast with a clear error message beats a cryptic "unbound variable" or silent auth failure later.
| exit 1 | ||
| fi | ||
|
|
||
| printf '%s\n' "${REGISTRY_PASSWORD}" | podman-hpc login --username "${REGISTRY_USER}" --password-stdin "${REGISTRY_NAME}" |
There was a problem hiding this comment.
Passwords passed as arguments are visible in ps output and shell history; reading from stdin avoids this. Using printf '%s\n' is also slightly better than echo since echo behavior with special characters can be shell-dependent.
| OCI_MANIFEST_TYPE="application/vnd.oci.image.manifest.v1+json" | ||
| # first we unwarp the provenance information to pick the right image manifest | ||
| MANIFEST_INDEX=$(curl -s \ | ||
| MANIFEST_INDEX=$(curl -fsS \ |
There was a problem hiding this comment.
-f makes curl return a non-zero exit code on HTTP errors (4xx/5xx), and -S shows error messages even in silent mode. Without -f, a 401 or 404 would silently return an error document and let the script continue.
| fi | ||
| # then we download the manifest of the image and parse out the config digest SHA | ||
| REMOTE_SHA=$(curl -s \ | ||
| REMOTE_SHA=$(curl -fsS \ |
There was a problem hiding this comment.
-f makes curl return a non-zero exit code on HTTP errors (4xx/5xx), and -S shows error messages even in silent mode. Without -f, a 401 or 404 would silently return an error document and let the script continue.
| -u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \ | ||
| "https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${MANIFEST_INDEX}" | \ | ||
| jq -r '.config.digest') | ||
| jq -er '.config.digest') |
There was a problem hiding this comment.
jq -er instead of -r adds -e, which makes jq exit non-zero if the output is false or null. Combined with set -e and curl -f, this closes a gap where a malformed or unexpected JSON response could let a null digest pass through and cause a misleading "IDs are different" message or worse.
| -v /etc/localtime:/etc/localtime \ | ||
| -v /global/cfs/cdirs/m558/superfacility/model_training/config.yaml:/app/ml/config.yaml \ | ||
| --env-file $HOME/db-podman.profile \ | ||
| --env-file "${DB_ENV_FILE}" \ |
There was a problem hiding this comment.
Quoting ${DB_ENV_FILE} in --env-file is a minor but correct fix for paths with spaces.
There was a problem hiding this comment.
Pull request overview
Hardens the ml/training_pm.sbatch Slurm training script by enabling strict bash mode, validating required credential/env files before use, and reducing credential exposure during registry login.
Changes:
- Enable strict error handling via
set -euo pipefail. - Validate registry credential and container env files exist; assert required registry variables are set.
- Improve robustness of registry/manifest fetching with
curl -f/jq -eand switch registry login to--password-stdin.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| set -euo pipefail | ||
|
|
| MANIFEST_INDEX=$(curl -fsS \ | ||
| -H "Accept: ${OCI_INDEX_TYPE}" \ | ||
| -u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \ | ||
| "https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \ |
There was a problem hiding this comment.
This was pre-existing and is unrelated to the changes in this PR.
| MANIFEST_INDEX=$(curl -fsS \ | ||
| -H "Accept: ${OCI_INDEX_TYPE}" \ | ||
| -u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \ | ||
| "https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \ | ||
| jq -r '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI") | ||
| jq -er '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI") | ||
| # fallback for old Docker builders without OCI index |
There was a problem hiding this comment.
The original .manifests[] would cause jq to error if the manifests key was absent from the JSON response. For example, when the registry returns a plain Docker V2 manifest instead of an OCI index. The // [] fallback means jq gracefully produces no output rather than erroring, which allows the || echo "NOOCI" fallback to trigger correctly and the script to continue to the Docker V2 code path.
| MANIFEST_INDEX=$(curl -fsS \ | |
| -H "Accept: ${OCI_INDEX_TYPE}" \ | |
| -u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \ | |
| "https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \ | |
| jq -r '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI") | |
| jq -er '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI") | |
| # fallback for old Docker builders without OCI index | |
| MANIFEST_INDEX=$(curl -fsS \ | |
| -H "Accept: ${OCI_INDEX_TYPE}" \ | |
| -u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \ | |
| "https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \ | |
| jq -er '(.manifests // [])[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI") | |
| # fallback for old Docker builders without OCI index |
| --rm -it ${REGISTRY_NAME}/${IMAGE_NAME}:${IMAGE_VERSION} \ | ||
| python -u /app/ml/train_model.py --config_file /app/ml/config.yaml --model ${model} |
Overview
set -euo pipefailfor strict error handling--password-stdinfor podman-hpc login to avoid exposing password in process listcurl -fto fail on HTTP errors andjq -eto fail on null/false outputMerge first
To do
Notes
Changes isolated from #428.