Skip to content

[WIP] Harden training script: credentials validation, error handling, security fixes#432

Open
EZoni wants to merge 2 commits into
BLAST-AI-ML:mainfrom
EZoni:harden_training_script
Open

[WIP] Harden training script: credentials validation, error handling, security fixes#432
EZoni wants to merge 2 commits into
BLAST-AI-ML:mainfrom
EZoni:harden_training_script

Conversation

@EZoni
Copy link
Copy Markdown
Member

@EZoni EZoni commented May 1, 2026

Overview

  • Add set -euo pipefail for strict error handling
  • Validate registry and DB env files exist before sourcing; assert required variables are set
  • Use --password-stdin for podman-hpc login to avoid exposing password in process list
  • Add curl -f to fail on HTTP errors and jq -e to fail on null/false output
  • Quote variable references consistently

Merge first

To do

  • Split features and bug fixes

Notes

Changes isolated from #428.

…ity fixes

- Add `set -euo pipefail` for strict error handling
- Validate registry and DB env files exist before sourcing; assert
  required variables are set
- Use `--password-stdin` for podman-hpc login to avoid exposing
  password in process list
- Add `curl -f` to fail on HTTP errors and `jq -e` to fail on
  null/false output
- Quote variable references consistently
@EZoni EZoni added the ml Changes related to the ML models label May 1, 2026
Comment thread ml/training_pm.sbatch Outdated
@@ -1,4 +1,5 @@
#!/bin/bash -l
set -euo pipefail
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-e exits on error, -u catches unbound variables, and -o pipefail catches failures mid-pipeline.

Comment thread ml/training_pm.sbatch
Comment on lines +30 to +43
if [ ! -r "${REGISTRY_PROFILE}" ]; then
echo "Missing registry credentials file: ${REGISTRY_PROFILE}" >&2
echo "Create it with REGISTRY_USER and REGISTRY_PASSWORD for ${REGISTRY_NAME}." >&2
exit 1
fi
source "${REGISTRY_PROFILE}" # credential variables: REGISTRY_USER and REGISTRY_PASSWORD
: "${REGISTRY_USER:?REGISTRY_USER must be set in ${REGISTRY_PROFILE}}"
: "${REGISTRY_PASSWORD:?REGISTRY_PASSWORD must be set in ${REGISTRY_PROFILE}}"

if [ ! -r "${DB_ENV_FILE}" ]; then
echo "Missing container environment file: ${DB_ENV_FILE}" >&2
echo "Create it with SF_DB_READONLY_PASSWORD and AM_SC_API_KEY." >&2
exit 1
fi
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing fast with a clear error message beats a cryptic "unbound variable" or silent auth failure later.

Comment thread ml/training_pm.sbatch
exit 1
fi

printf '%s\n' "${REGISTRY_PASSWORD}" | podman-hpc login --username "${REGISTRY_USER}" --password-stdin "${REGISTRY_NAME}"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passwords passed as arguments are visible in ps output and shell history; reading from stdin avoids this. Using printf '%s\n' is also slightly better than echo since echo behavior with special characters can be shell-dependent.

Comment thread ml/training_pm.sbatch
OCI_MANIFEST_TYPE="application/vnd.oci.image.manifest.v1+json"
# first we unwarp the provenance information to pick the right image manifest
MANIFEST_INDEX=$(curl -s \
MANIFEST_INDEX=$(curl -fsS \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-f makes curl return a non-zero exit code on HTTP errors (4xx/5xx), and -S shows error messages even in silent mode. Without -f, a 401 or 404 would silently return an error document and let the script continue.

Comment thread ml/training_pm.sbatch
fi
# then we download the manifest of the image and parse out the config digest SHA
REMOTE_SHA=$(curl -s \
REMOTE_SHA=$(curl -fsS \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-f makes curl return a non-zero exit code on HTTP errors (4xx/5xx), and -S shows error messages even in silent mode. Without -f, a 401 or 404 would silently return an error document and let the script continue.

Comment thread ml/training_pm.sbatch
-u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \
"https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${MANIFEST_INDEX}" | \
jq -r '.config.digest')
jq -er '.config.digest')
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq -er instead of -r adds -e, which makes jq exit non-zero if the output is false or null. Combined with set -e and curl -f, this closes a gap where a malformed or unexpected JSON response could let a null digest pass through and cause a misleading "IDs are different" message or worse.

Comment thread ml/training_pm.sbatch
-v /etc/localtime:/etc/localtime \
-v /global/cfs/cdirs/m558/superfacility/model_training/config.yaml:/app/ml/config.yaml \
--env-file $HOME/db-podman.profile \
--env-file "${DB_ENV_FILE}" \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting ${DB_ENV_FILE} in --env-file is a minor but correct fix for paths with spaces.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the ml/training_pm.sbatch Slurm training script by enabling strict bash mode, validating required credential/env files before use, and reducing credential exposure during registry login.

Changes:

  • Enable strict error handling via set -euo pipefail.
  • Validate registry credential and container env files exist; assert required registry variables are set.
  • Improve robustness of registry/manifest fetching with curl -f/jq -e and switch registry login to --password-stdin.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ml/training_pm.sbatch Outdated
Comment on lines 2 to 3
set -euo pipefail

Comment thread ml/training_pm.sbatch
Comment on lines +55 to 58
MANIFEST_INDEX=$(curl -fsS \
-H "Accept: ${OCI_INDEX_TYPE}" \
-u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \
"https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pre-existing and is unrelated to the changes in this PR.

Comment thread ml/training_pm.sbatch
Comment on lines +55 to 60
MANIFEST_INDEX=$(curl -fsS \
-H "Accept: ${OCI_INDEX_TYPE}" \
-u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \
"https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \
jq -r '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI")
jq -er '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI")
# fallback for old Docker builders without OCI index
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original .manifests[] would cause jq to error if the manifests key was absent from the JSON response. For example, when the registry returns a plain Docker V2 manifest instead of an OCI index. The // [] fallback means jq gracefully produces no output rather than erroring, which allows the || echo "NOOCI" fallback to trigger correctly and the script to continue to the Docker V2 code path.

Suggested change
MANIFEST_INDEX=$(curl -fsS \
-H "Accept: ${OCI_INDEX_TYPE}" \
-u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \
"https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \
jq -r '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI")
jq -er '.manifests[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI")
# fallback for old Docker builders without OCI index
MANIFEST_INDEX=$(curl -fsS \
-H "Accept: ${OCI_INDEX_TYPE}" \
-u "${REGISTRY_USER}:${REGISTRY_PASSWORD}" \
"https://${REGISTRY_NAME}/v2/${IMAGE_NAME}/manifests/${IMAGE_VERSION}" | \
jq -er '(.manifests // [])[] | select(.annotations["vnd.docker.reference.type"] != "attestation-manifest") | .digest' || echo "NOOCI")
# fallback for old Docker builders without OCI index

Comment thread ml/training_pm.sbatch
Comment on lines 88 to 89
--rm -it ${REGISTRY_NAME}/${IMAGE_NAME}:${IMAGE_VERSION} \
python -u /app/ml/train_model.py --config_file /app/ml/config.yaml --model ${model}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ml Changes related to the ML models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants