Skip to content

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904

Open
LizBaldo wants to merge 28 commits into
developfrom
CTM-397-deploy-galaxy-on-GCE
Open

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904
LizBaldo wants to merge 28 commits into
developfrom
CTM-397-deploy-galaxy-on-GCE

Conversation

@LizBaldo

@LizBaldo LizBaldo commented Apr 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the GKE/Helm-based Galaxy deployment path with a GCE VM-based deployment using galaxy-k8s-boot. Rather than provisioning a full GKE cluster and deploying Galaxy via Helm, Leo now creates a single GCE VM that runs Galaxy via Ansible/microk8s.

What changed

Galaxy VM provisioning (GKEInterpreter.installGalaxyVm)

  • Creates a GCE VM with a boot disk, data disk, and PostgreSQL disk
  • Passes galaxy-user-email GCE metadata (from app.auditInfo.creator) so the workspace user becomes the Galaxy admin
  • Gets or creates a galaxy-batch-runner service account in the user's project
  • Grants the pet SA roles/batch.jobsEditor at the project level and roles/iam.serviceAccountUser on the Batch SA, so the VM can submit GCP Batch jobs
  • Creates an NFS firewall rule (leonardo-galaxy-allow-nfs-for-batch) so Batch VMs can reach the Galaxy VM's NFS server (TCP/UDP 2049 and 111)
  • Polls the instance until it has an external IP; stores it as loadBalancerIp so the Leo proxy can reach the VM across VPC boundaries

Network topology and IP choice

  • Leo's GKE cluster is in Leo's GCP project; Galaxy VMs are created in the user's workspace project. The two VPCs are not peered, so the VM's internal IP (10.x.x.x) is not routable from Leo's pod
  • The leonardo-allow-http firewall rule (TCP port 80, source 0.0.0.0/0, targeting VMs with the leonardo network tag) allows Leo to reach the VM's external IP. Galaxy VMs are created with the leonardo tag
  • Both the readiness health check (isVmReachable) and the Akka HTTP proxy use the external IP

⚠️ Security note: leonardo-allow-http currently allows port 80 from 0.0.0.0/0, meaning the Galaxy VM is reachable directly from the internet, bypassing Leo's workspace-level authorization. This is tracked as a follow-up — options include restricting source ranges to Leo's GKE node CIDR, a GCP service-account-based firewall rule, or a shared-secret header enforced by Galaxy's nginx. See PR discussion for details.

Galaxy VM readiness health check

  • isProxyAvailable routes through Leo's own proxy hostname; in BEE environments the wildcard DNS resolves to the ingress controller's external IP, unreachable via hairpin NAT from within the GKE pod → TCP timeout
  • Added AppDAO.isVmReachable(ip, port) — a direct http4s HTTP GET to http://<externalIp>:80/. No proxy hostname resolution required. MockAppDAO returns IO.pure(isUp) for tests

Leo proxy: HTTP support for Galaxy VM backends

  • Added useHttp: Boolean = false to HostReady — when true the proxy connects via plain HTTP port 80 (ws:// for WebSocket) instead of HTTPS port 443
  • KubernetesDnsCache sets useHttp = true for AppType.Galaxy apps and maps the fake proxy hostname to the VM's external IP
  • ProxyService.handleHttpRequest / handleWebSocketRequest branch on useHttp; all non-Galaxy backends are unchanged

Leo proxy: path handling for Galaxy VM (ProxyService.proxyAppRequest)

  • Leo forwards the full path (e.g. /proxy/google/v1/apps/{project}/{app}/galaxy/...) to the Galaxy VM unchanged
  • galaxy-k8s-boot's ansible playbook configures Galaxy's nginx ingress.path to the value of galaxy_prefix, which Leo passes as a GCE metadata item (galaxy-url-prefix). Galaxy's nginx therefore serves at the full Leo proxy path, so all requests route correctly without any path rewriting in Leo

NFS PVC size: GB → GiB conversion fix (GKEInterpreter.installGalaxyVm)

  • pvSizeGi = nfsDisk.size.gb - 11 treated decimal GB as binary GiB. For a 500 GB disk: the disk holds ~466 GiB but Leo requested 489 GiB → NFS provisioner fails with insufficient available space, leaving all Galaxy pods Pending
  • Fixed to convert first: diskSizeGiB = (nfsDisk.size.gb.toLong * 1000^3) / 1024^3, then subtract 11 GiB overhead

Lifecycle: restore from existing disks

  • restore = msg.appType == AppType.Galaxy && msg.createDisk.isEmpty: when a Galaxy app is created without a new disk, the disks already exist (prior app was deleted keeping disks)
  • In restore mode: skips creating the PostgreSQL disk; passes restore_galaxy=true metadata to Ansible
  • CreateAppParams.restore: Boolean propagates the flag through to installGalaxyVm

Lifecycle: delete keeping disks

  • DeleteAppMessage(diskId = None) → VM is deleted, both disks are preserved
  • DeleteAppMessage(diskId = Some(...)) → VM + both disks deleted

Config cleanup

  • Removed gcpBatchServiceAccountEmail from reference.conf / GalaxyVmConfig / Config.scala — Leo now creates the SA dynamically via getOrCreateServiceAccount instead of relying on a pre-configured email

Architecture notes

Property GKE Galaxy (before) GCE VM Galaxy (now)
Backend GKE cluster + Helm Single GCE VM (galaxy-k8s-boot anvil branch)
VM bootstrap N/A GCE user-data cloud-init via guest agent
Proxy backend protocol HTTPS port 443 (nginx ingress TLS) HTTP port 80 (nginx reverse proxy)
Backend IP Ingress load balancer IP (external) VM external IP
Proxy reachability Via GKE LoadBalancer service IP Via leonardo-allow-http firewall (0.0.0.0/0 → port 80, leonardo tag)
Readiness check isProxyAvailable via proxy hostname isVmReachable direct HTTP to external IP
Batch jobs N/A GCP Batch via galaxy-batch-runner SA

Security comparison: old GKE-based vs. new VM-based

Property Old (GKE-based) New (VM-based)
Protocol HTTPS (mTLS) Plain HTTP
Target IP GKE load balancer (external) VM external IP
Port exposed 443 80
Firewall source range 0.0.0.0/0 0.0.0.0/0
Certificate validation Yes — Leo-issued cert on nginx None

Regressions introduced:

  1. No encryption — traffic between Leo's pod and the Galaxy VM crosses the public internet in plaintext
  2. Direct VM access — port 80 is open to 0.0.0.0/0, so anyone who discovers the VM's external IP can reach Galaxy directly, bypassing Leo's authentication

Alternatives for follow-up:

  • Option A — VPC peering / Private Service Connect (recommended structural fix): peer Leo's project VPC with the user's workspace VPC so Leo can reach the VM on its internal IP; the 0.0.0.0/0 firewall rule is no longer needed
  • Option B — HTTPS on the Galaxy VM (closest to old security posture): configure nginx on the Galaxy VM with Leo's CA certificate (as Jupyter VMs do); flip useHttp = false for Galaxy; requires provisioning Leo certs onto the VM during installGalaxyVm
  • Option C — GCP IAP or Cloud Armor (lower-effort mitigation): restricts direct VM access without VPC changes, but does not encrypt the Leo→VM leg

Test plan

  • Unit tests pass (GKEInterpreterSpec, LeoPubsubMessageSubscriberSpec)
  • Scala formatting clean
  • BEE: create Galaxy app → VM boots, Ansible runs, status goes to Running
  • BEE: access Galaxy through Leo proxy URL
  • BEE: verify workspace user email is the Galaxy admin (not a hardcoded address)
  • BEE: delete app keeping disks → VM deleted, disks remain
  • BEE: re-create app from existing disks → restore_galaxy=true passed to Ansible, Galaxy restores state
  • Follow-up: restrict leonardo-allow-http source range from 0.0.0.0/0 to Leo's GKE node CIDR

@LizBaldo LizBaldo requested a review from a team as a code owner April 7, 2026 17:39
Liz Baldo and others added 4 commits April 7, 2026 13:51
At Galaxy VM creation, grant the pet SA roles/batch.jobsEditor on the
user's project so it can submit and monitor GCP Batch jobs. When the
Batch SA lives in the same project, also grant serviceAccountUser on it;
cross-project Batch SAs must have that binding configured externally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.41237% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.86%. Comparing base (5b339ea) to head (8fb089b).
⚠️ Report is 12 commits behind head on develop.

Files with missing lines Patch % Lines
.../dsde/workbench/leonardo/util/GKEInterpreter.scala 85.43% 22 Missing ⚠️
...itute/dsde/workbench/leonardo/dao/HttpAppDAO.scala 0.00% 9 Missing ⚠️
...de/workbench/leonardo/dns/KubernetesDnsCache.scala 0.00% 3 Missing ⚠️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala 0.00% 2 Missing ⚠️
...workbench/leonardo/http/service/ProxyService.scala 80.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4904      +/-   ##
===========================================
- Coverage    74.08%   73.86%   -0.22%     
===========================================
  Files          131      131              
  Lines        11100    11166      +66     
  Branches       895      901       +6     
===========================================
+ Hits          8223     8248      +25     
- Misses        2877     2918      +41     
Files with missing lines Coverage Δ
...titute/dsde/workbench/leonardo/config/Config.scala 97.79% <100.00%> (+0.03%) ⬆️
...orkbench/leonardo/config/KubernetesAppConfig.scala 95.00% <ø> (ø)
...stitute/dsde/workbench/leonardo/dao/ProxyDAO.scala 25.00% <ø> (ø)
...orkbench/leonardo/db/PersistentDiskComponent.scala 97.29% <100.00%> (-2.02%) ⬇️
...ch/leonardo/http/service/LeoAppServiceInterp.scala 84.34% <ø> (ø)
.../leonardo/monitor/LeoPubsubMessageSubscriber.scala 76.80% <100.00%> (-0.20%) ⬇️
...workbench/leonardo/util/BuildHelmChartValues.scala 97.87% <ø> (-0.47%) ⬇️
...tute/dsde/workbench/leonardo/util/GKEAlgebra.scala 80.00% <ø> (-20.00%) ⬇️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala 19.44% <0.00%> (-0.56%) ⬇️
...workbench/leonardo/http/service/ProxyService.scala 77.00% <80.00%> (-0.56%) ⬇️
... and 3 more

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b339ea...8fb089b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@LizBaldo LizBaldo requested a review from lucymcnatt April 8, 2026 15:58
@LizBaldo LizBaldo requested a review from afgane April 13, 2026 13:00
@LizBaldo

Copy link
Copy Markdown
Collaborator Author

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM:
Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

@LizBaldo

Copy link
Copy Markdown
Collaborator Author

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM: Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

The Galaxy team made the image public so I am currently unblocked :)

@aednichols

Copy link
Copy Markdown
Contributor

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

@LizBaldo

Copy link
Copy Markdown
Collaborator Author

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

I agree, but this is a departure from how we used to handle it, I am not sure that the compliance review covered this so I want to triple check before merging

Liz Baldo and others added 5 commits May 8, 2026 13:21
…, gcpBatchSaProject parsing

- Pass app.auditInfo.creator as galaxy-user-email GCE metadata so the
  actual workspace user (not dev@galaxyproject.org) becomes the Galaxy admin
- Fix scala.io.Source resource leak in installGalaxyVm using scala.util.Using
- Update sourceImage to galaxy-k8s-boot-v2026-06-10 and gitBranch to "anvil"
- Fix HOST_IP to use GCE metadata server instead of external ifconfig.me
- Fix gcpBatchSaProject SA email parsing: lift(1) + stripSuffix instead of
  lastOption + replace to avoid matching suffix in unexpected positions
- Correct stale comments: galaxy_url_prefix → galaxy_prefix, dev → anvil branch,
  wrong "internal IP" comment corrected to "external IP"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pre-baked galaxy-k8s-boot image carries cloud-init state from its build,
so cloud-init treats new VM launches as subsequent boots and skips runcmd —
causing "No startup scripts to run" in Guest Agent logs and a VM that never
bootstraps Galaxy.

Fix: pass the bootstrap script as the "startup-script" metadata key instead
of "user-data". The GCE Guest Agent always executes startup-script on boot,
regardless of cloud-init state.

galaxy-user-data.sh is reformatted from cloud-config YAML to a plain bash
script. The sudo -u debian block now uses a single-quoted heredoc delimiter
(<<'DEBIAN_EOF') to avoid apostrophes in comments breaking shell quoting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n playbook.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ty with anvil playbook (helm list --all)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Liz Baldo and others added 6 commits June 15, 2026 11:18
…th check

Two bugs found during BEE testing:

1. Blank page on Galaxy load: the anvil playbook's galaxy_prefix only
   configures the nginx ingress path, not Galaxy's own galaxy_url_prefix
   in galaxy.yml. Without it Galaxy generates /static/... links that
   resolve without the Leo proxy prefix, leaving a blank page. Fixed by
   passing galaxy_helm_extra_sets via an extra-vars YAML file so Helm
   also sets configs.galaxy\.yml.galaxy_url_prefix.

2. App marked Running too early: isVmReachable was polling GET / which
   returns 404 from nginx (no ingress rule at root) — 404 < 500 = true —
   so Leo marked the app Running before Galaxy pods were ready. Changed
   to poll the galaxy prefix path and require status < 400 so 404 (no
   ingress yet) and 502 (pods starting) both keep polling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…is literal

YAML double-quoted strings treat \ as escape, making \.yml invalid.
Single-quoted YAML scalars treat backslash as literal, which is what
Helm needs for the dotted key configs.galaxy\.yml.galaxy_url_prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cloudve/galaxy chart stores Galaxy config under configs.galaxy.yml.galaxy.*
so the correct key is configs.galaxy\.yml.galaxy.galaxy_url_prefix, not
configs.galaxy\.yml.galaxy_url_prefix. The previous key wrote the value at
the wrong level of galaxy.yml where Galaxy doesn't read it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ss probe

Three changes:
- ProxyService: add Authorization to HeadersToFilter so Leo no longer
  forwards the Terra Bearer JWT to backends. Galaxy 23.1+ treats any
  Authorization: Bearer value as a Galaxy API key; the Terra JWT fails
  validation and every API call returns 400, causing a blank page.
  Leo is the auth boundary — backends should not receive user tokens.
- GKEInterpreter: poll /api/version instead of the bare galaxy prefix
  path for the VM readiness check. The bare path can return a 301
  redirect (nginx trailing-slash normalisation) before Galaxy's Python
  backend is ready, satisfying < 400 and marking Running too early.
- galaxy-user-data.sh: remove the galaxy_helm_extra_sets extra-vars
  block. The CloudVE chart already auto-wires galaxy_url_prefix from
  ingress.path via double-tpl, so the override is redundant. It also
  risks breaking restore mode: the playbook's set_fact overrides
  galaxy_helm_extra_sets with PVC values for restore, and if Ansible
  extra-vars win precedence, restore mode fails to apply existingClaim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… AppRestore.Other for Galaxy

When a VM-based Galaxy app is deleted with disks kept, Leo never called
updateLastUsedBy/updateGalaxyDiskRestore so the disk's appRestore was always None.
On re-create, LeoAppServiceInterp threw "no restore info found in DB".

Three coordinated fixes:
- LeoPubsubMessageSubscriber.deleteApp: call updateLastUsedBy for the data disk whenever
  the disk is being kept (msg.diskId is absent); works for both GKE and VM Galaxy.
- PersistentDiskComponent: when formattedBy=Galaxy but galaxyPvcId is null (VM path has
  no Kubernetes PVC), map to AppRestore.Other(lastUsedBy) instead of producing None.
- LeoAppServiceInterp: accept AppRestore.Other for Galaxy in the restore success match so
  VM apps that have AppRestore.Other (rather than GalaxyRestore) can be re-created.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LizBaldo

Copy link
Copy Markdown
Collaborator Author

@afgane I was able to deploy a new Galaxy app, then delete it while keeping the disk and then recreating a new app using the same disk 🎉 I'll do one final round of testing once you have the latest image and then we should be good to go :)

…ore.Other not None

GKEInterpreter.createAndPollApp already calls updateLastUsedBy after installGalaxyVm
(line 431), so the keep-disk delete path does not need to set it separately.

The prior commit's PersistentDiskComponent change now correctly maps a Galaxy disk with
lastUsedBy set but no galaxyPvcId to AppRestore.Other instead of None. That exposed a
fragile test assertion that assumed appRestore would be None and used an unsafe
asInstanceOf[GalaxyRestore] cast. Updated assertion to expect Some(AppRestore.Other(appId)).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants