Skip to content

helm: tighten /ref volume split between orchestrator, workers, and API #8

@lewisjared

Description

@lewisjared

Problem

The current Helm chart treats the orchestrator and all provider workers as a single defaults block, so every worker pod gets the same RW mount on /ref. That is broader than the application actually requires, and it forces every worker to share a writable PVC with the orchestrator.

Actual access requirements

After tracing climate_ref.config.PathConfig and climate_ref_celery.worker_tasks:

Path API Provider workers Orchestrator Migrate Job
/ref (config TOML) RO RO RO RO
/ref/software RO RO RW (ref providers setup writes conda envs)
/ref/scratch RW (per-pod emptyDir is fine) RW
/ref/results RO RW (handle_result task copies scratch -> results)
/ref/log RW (currently unused)
/tmp (HOME) RW RW RW RW

Provider workers run celery start-worker --provider X and consume only their provider queue. The orchestrator runs celery start-worker (no --provider flag) and is the only deployment that consumes the default celery queue, where handle_result performs the scratch -> results copy. So provider workers never touch /ref/results and never write /ref/software.

Why this matters

  • Provider workers should not have RW access to the conda env tree (/ref/software); a buggy or compromised diagnostic could clobber other providers' environments.
  • Workers should not need shared RW access to /ref/scratch; a per-pod emptyDir is enough because the orchestrator copies the artefacts out via handle_result before the pod is recycled.
  • Today these constraints are not expressed in the chart, so users defaulting to defaults.volumes end up with a single shared RW PVC across every pod.

Proposed direction

  1. Surface the orchestrator as its own top-level chart block (or a sentinel under providers) with its own volumes / volumeMounts defaults rather than relying on the implicit providers.orchestrator entry.
  2. Set chart defaults so:
    • Orchestrator: RW /ref (or RW /ref/software, /ref/results, /ref/log plus RO config).
    • Provider workers: RO /ref + per-pod emptyDir for /ref/scratch.
    • API: RO /ref (already the recommended pattern).
  3. Update helm/ci/gh-actions-values.yaml and helm/local-test-values.yaml to match the new split, and document the contract in helm/README.md under "Required Volumes".

Workaround until then

The looser layout (single shared RW PVC at /ref for all worker pods) still works and is what helm/README.md currently documents. This issue tracks tightening the model rather than a regression.

References

  • climate_ref/config.py PathConfig — defines the /ref/{software,scratch,results,log} layout
  • climate_ref_celery/worker_tasks.py:handle_result — orchestrator-only task that copies scratch -> results
  • helm/templates/providers/deployment.yaml — currently includes orchestrator under the same range as provider workers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions