Problem
The current Helm chart treats the orchestrator and all provider workers as a single defaults block, so every worker pod gets the same RW mount on /ref. That is broader than the application actually requires, and it forces every worker to share a writable PVC with the orchestrator.
Actual access requirements
After tracing climate_ref.config.PathConfig and climate_ref_celery.worker_tasks:
| Path |
API |
Provider workers |
Orchestrator |
Migrate Job |
/ref (config TOML) |
RO |
RO |
RO |
RO |
/ref/software |
RO |
RO |
RW (ref providers setup writes conda envs) |
— |
/ref/scratch |
— |
RW (per-pod emptyDir is fine) |
RW |
— |
/ref/results |
RO |
— |
RW (handle_result task copies scratch -> results) |
— |
/ref/log |
— |
— |
RW (currently unused) |
— |
/tmp (HOME) |
RW |
RW |
RW |
RW |
Provider workers run celery start-worker --provider X and consume only their provider queue. The orchestrator runs celery start-worker (no --provider flag) and is the only deployment that consumes the default celery queue, where handle_result performs the scratch -> results copy. So provider workers never touch /ref/results and never write /ref/software.
Why this matters
- Provider workers should not have RW access to the conda env tree (
/ref/software); a buggy or compromised diagnostic could clobber other providers' environments.
- Workers should not need shared RW access to
/ref/scratch; a per-pod emptyDir is enough because the orchestrator copies the artefacts out via handle_result before the pod is recycled.
- Today these constraints are not expressed in the chart, so users defaulting to
defaults.volumes end up with a single shared RW PVC across every pod.
Proposed direction
- Surface the orchestrator as its own top-level chart block (or a sentinel under
providers) with its own volumes / volumeMounts defaults rather than relying on the implicit providers.orchestrator entry.
- Set chart defaults so:
- Orchestrator: RW
/ref (or RW /ref/software, /ref/results, /ref/log plus RO config).
- Provider workers: RO
/ref + per-pod emptyDir for /ref/scratch.
- API: RO
/ref (already the recommended pattern).
- Update
helm/ci/gh-actions-values.yaml and helm/local-test-values.yaml to match the new split, and document the contract in helm/README.md under "Required Volumes".
Workaround until then
The looser layout (single shared RW PVC at /ref for all worker pods) still works and is what helm/README.md currently documents. This issue tracks tightening the model rather than a regression.
References
climate_ref/config.py PathConfig — defines the /ref/{software,scratch,results,log} layout
climate_ref_celery/worker_tasks.py:handle_result — orchestrator-only task that copies scratch -> results
helm/templates/providers/deployment.yaml — currently includes orchestrator under the same range as provider workers
Problem
The current Helm chart treats the orchestrator and all provider workers as a single
defaultsblock, so every worker pod gets the same RW mount on/ref. That is broader than the application actually requires, and it forces every worker to share a writable PVC with the orchestrator.Actual access requirements
After tracing
climate_ref.config.PathConfigandclimate_ref_celery.worker_tasks:/ref(config TOML)/ref/softwareref providers setupwrites conda envs)/ref/scratchemptyDiris fine)/ref/resultshandle_resulttask copies scratch -> results)/ref/log/tmp(HOME)Provider workers run
celery start-worker --provider Xand consume only their provider queue. The orchestrator runscelery start-worker(no--providerflag) and is the only deployment that consumes the defaultceleryqueue, wherehandle_resultperforms the scratch -> results copy. So provider workers never touch/ref/resultsand never write/ref/software.Why this matters
/ref/software); a buggy or compromised diagnostic could clobber other providers' environments./ref/scratch; a per-podemptyDiris enough because the orchestrator copies the artefacts out viahandle_resultbefore the pod is recycled.defaults.volumesend up with a single shared RW PVC across every pod.Proposed direction
providers) with its ownvolumes/volumeMountsdefaults rather than relying on the implicitproviders.orchestratorentry./ref(or RW/ref/software,/ref/results,/ref/logplus RO config)./ref+ per-podemptyDirfor/ref/scratch./ref(already the recommended pattern).helm/ci/gh-actions-values.yamlandhelm/local-test-values.yamlto match the new split, and document the contract inhelm/README.mdunder "Required Volumes".Workaround until then
The looser layout (single shared RW PVC at
/reffor all worker pods) still works and is whathelm/README.mdcurrently documents. This issue tracks tightening the model rather than a regression.References
climate_ref/config.pyPathConfig— defines the/ref/{software,scratch,results,log}layoutclimate_ref_celery/worker_tasks.py:handle_result— orchestrator-only task that copies scratch -> resultshelm/templates/providers/deployment.yaml— currently includesorchestratorunder the same range as provider workers