This guide covers using the raw run:ai CLI without the csub.py wrapper.
Note
Recommended for most users: The csub.py + .env workflow described in the main README.
This guide is for advanced users who:
- Prefer to drive everything through the raw run:ai CLI
- Want to use Thijs' base images directly without
csub.py - Need more granular control over job submission
Thijs created several base images with common packages pre-installed:
| Image | Includes |
|---|---|
mlo/basic |
numpy, jupyter, common utilities |
mlo/pytorch |
basic + computer vision + PyTorch |
mlo/jax |
basic + computer vision + JAX |
mlo/tensorflow |
basic + computer vision + TensorFlow |
mlo/latex |
basic + texlive (for LaTeX documents) |
Registry: ic-registry.epfl.ch/mlo/<image>:latest
To update these images:
- Clone: https://github.com/epfml/mlocluster-setup
- Navigate to
docker-images/ - Run
./publish.sh
- Quick start: Follow a Docker tutorial
- MLO integration: See Architecture: Images & Publishing
Interactive sessions are ideal for development, debugging, and exploratory work.
runai submit \
--name sandbox \
--interactive \
--gpu 1 \
--image ic-registry.epfl.ch/mlo/pytorch:latest \
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
--large-shm --host-ipc \
--environment EPFML_LDAP=$GASPAR_USERNAME \
--command -- /entrypoint.sh sleep infinityFlag explanations:
--interactive: Marks as interactive job (1 GPU limit, higher priority, 12h max)--gpu 1: Request 1 GPU--image: Docker image to use--pvc: Mount scratch storage--large-shm --host-ipc: Optimization flags for shared memory--environment: Pass environment variables--command: Keep pod running indefinitely
Monitor status (can take up to 10 minutes):
runai describe job sandboxWait until status shows RUNNING.
runai exec sandbox -it -- su $GASPAR_USERNAMEWhy su $GASPAR_USERNAME?
- Gives you a shell under your user account
- Enables access to network storage (
/mloscratch) - Root user cannot access
/mloscratchdue to NFS permissions
Important
Compatibility note: These base images are not plug-and-play compatible with the csub.py workflow.
- csub.py workflow: Uses
NB_UID,NB_GIDenvironment variables anddocker/entrypoint.shto mirror your Gaspar identity - Thijs' images: Use a different layout (root + separate user)
You can:
- Use these images with raw CLI (as shown here)
- Adapt them to the new entrypoint/UID model for
csub.pycompatibility
Training jobs are for actual experiments and long-running workloads.
runai submit \
--name experiment-hyperparams-1 \
--gpu 1 \
--image ic-registry.epfl.ch/mlo/pytorch:latest \
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
--large-shm --host-ipc \
--environment EPFML_LDAP=$GASPAR_USERNAME \
--environment LEARNING_RATE=0.5 \
--environment OPTIMIZER=Adam \
--command -- /entrypoint.sh su $GASPAR_USERNAME -c 'cd /mloscratch/homes/$GASPAR_USERNAME/code && python train.py'Key differences from interactive jobs:
- No
--interactiveflag: Runs as training workload - Custom command: Executes your training script
- Environment variables: Pass hyperparameters as env vars
- Multiple GPUs: Can request more than 1 GPU (up to 8)
See this minimal CIFAR example with W&B logging.
Important
Job preemption: Your job can be killed anytime if run:ai needs space for other users.
Always implement:
- Checkpointing (save model state regularly)
- Recovery logic (resume from checkpoint)
See Managing Workflows for more best practices.
Advantages:
- ✅ Full control over submission parameters
- ✅ Can use any Docker image
- ✅ No dependency on
csub.pyscript - ✅ Easier to script custom workflows
Disadvantages:
- ❌ No automatic secret management (manual
kubectl create secret) - ❌ More verbose commands
- ❌ Manual UID/GID configuration needed
- ❌ Must manually sync SSH keys and tokens
Advantages:
- ✅ Automatic secret management from
.env - ✅ Consistent UID/GID mapping
- ✅ Auto-sync of SSH keys and tokens
- ✅ Shorter, cleaner commands
- ✅ Less error-prone
Disadvantages:
- ❌ Less flexibility for custom setups
- ❌ Requires compatible Docker image with entrypoint
runai submit \
--name dev-cpu \
--interactive \
--cpu 4 \
--memory 16G \
--image ic-registry.epfl.ch/mlo/pytorch:latest \
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
--environment EPFML_LDAP=$GASPAR_USERNAME \
--command -- /entrypoint.sh sleep infinityCost: ~3 CHF/month
runai submit \
--name multi-gpu-experiment \
--gpu 4 \
--image ic-registry.epfl.ch/mlo/pytorch:latest \
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
--large-shm --host-ipc \
--environment EPFML_LDAP=$GASPAR_USERNAME \
--command -- /entrypoint.sh su $GASPAR_USERNAME -c 'cd code && python train.py'runai submit \
--name jupyter \
--interactive \
--gpu 1 \
--image ic-registry.epfl.ch/mlo/pytorch:latest \
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
--port 8888:8888 \
--environment EPFML_LDAP=$GASPAR_USERNAME \
--command -- /entrypoint.sh su $GASPAR_USERNAME -c 'jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser'Then forward locally:
kubectl port-forward <pod-name> 8888:8888- Main README: Getting Started Guide
- Architecture: Deep Dive
- Managing Workflows: Daily Operations
- Distributed Training: Multi-node Guide
- run:ai Docs: https://docs.run.ai