CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment by LizBaldo · Pull Request #4904 · DataBiosphere/leonardo

LizBaldo · 2026-04-07T17:39:41Z

Summary

Replaces the GKE/Helm-based Galaxy deployment path with a GCE VM-based deployment using galaxy-k8s-boot. Rather than provisioning a full GKE cluster and deploying Galaxy via Helm, Leo now creates a single GCE VM that runs Galaxy via Ansible/microk8s.

What changed

Galaxy VM provisioning (GKEInterpreter.installGalaxyVm)

Creates a GCE VM with a boot disk, data disk, and PostgreSQL disk
Passes galaxy-user-email GCE metadata (from app.auditInfo.creator) so the workspace user becomes the Galaxy admin
Gets or creates a galaxy-batch-runner service account in the user's project
Grants the pet SA roles/batch.jobsEditor at the project level and roles/iam.serviceAccountUser on the Batch SA, so the VM can submit GCP Batch jobs
Creates an NFS firewall rule (leonardo-galaxy-allow-nfs-for-batch) so Batch VMs can reach the Galaxy VM's NFS server (TCP/UDP 2049 and 111)
Polls the instance until it has an external IP; stores it as loadBalancerIp so the Leo proxy can reach the VM across VPC boundaries

Network topology and IP choice

Leo's GKE cluster is in Leo's GCP project; Galaxy VMs are created in the user's workspace project. The two VPCs are not peered, so the VM's internal IP (10.x.x.x) is not routable from Leo's pod
The leonardo-allow-http firewall rule (TCP port 80, source 0.0.0.0/0, targeting VMs with the leonardo network tag) allows Leo to reach the VM's external IP. Galaxy VMs are created with the leonardo tag
Both the readiness health check (isVmReachable) and the Akka HTTP proxy use the external IP

⚠️ Security note: leonardo-allow-http currently allows port 80 from 0.0.0.0/0, meaning the Galaxy VM is reachable directly from the internet, bypassing Leo's workspace-level authorization. This is tracked as a follow-up — options include restricting source ranges to Leo's GKE node CIDR, a GCP service-account-based firewall rule, or a shared-secret header enforced by Galaxy's nginx. See PR discussion for details.

Galaxy VM readiness health check

isProxyAvailable routes through Leo's own proxy hostname; in BEE environments the wildcard DNS resolves to the ingress controller's external IP, unreachable via hairpin NAT from within the GKE pod → TCP timeout
Added AppDAO.isVmReachable(ip, port) — a direct http4s HTTP GET to http://<externalIp>:80/. No proxy hostname resolution required. MockAppDAO returns IO.pure(isUp) for tests

Leo proxy: HTTP support for Galaxy VM backends

Added useHttp: Boolean = false to HostReady — when true the proxy connects via plain HTTP port 80 (ws:// for WebSocket) instead of HTTPS port 443
KubernetesDnsCache sets useHttp = true for AppType.Galaxy apps and maps the fake proxy hostname to the VM's external IP
ProxyService.handleHttpRequest / handleWebSocketRequest branch on useHttp; all non-Galaxy backends are unchanged

Leo proxy: path handling for Galaxy VM (ProxyService.proxyAppRequest)

Leo forwards the full path (e.g. /proxy/google/v1/apps/{project}/{app}/galaxy/...) to the Galaxy VM unchanged
galaxy-k8s-boot's ansible playbook configures Galaxy's nginx ingress.path to the value of galaxy_prefix, which Leo passes as a GCE metadata item (galaxy-url-prefix). Galaxy's nginx therefore serves at the full Leo proxy path, so all requests route correctly without any path rewriting in Leo

NFS PVC size: GB → GiB conversion fix (GKEInterpreter.installGalaxyVm)

pvSizeGi = nfsDisk.size.gb - 11 treated decimal GB as binary GiB. For a 500 GB disk: the disk holds ~466 GiB but Leo requested 489 GiB → NFS provisioner fails with insufficient available space, leaving all Galaxy pods Pending
Fixed to convert first: diskSizeGiB = (nfsDisk.size.gb.toLong * 1000^3) / 1024^3, then subtract 11 GiB overhead

Lifecycle: restore from existing disks

restore = msg.appType == AppType.Galaxy && msg.createDisk.isEmpty: when a Galaxy app is created without a new disk, the disks already exist (prior app was deleted keeping disks)
In restore mode: skips creating the PostgreSQL disk; passes restore_galaxy=true metadata to Ansible
CreateAppParams.restore: Boolean propagates the flag through to installGalaxyVm

Lifecycle: delete keeping disks

DeleteAppMessage(diskId = None) → VM is deleted, both disks are preserved
DeleteAppMessage(diskId = Some(...)) → VM + both disks deleted

Config cleanup

Removed gcpBatchServiceAccountEmail from reference.conf / GalaxyVmConfig / Config.scala — Leo now creates the SA dynamically via getOrCreateServiceAccount instead of relying on a pre-configured email

Architecture notes

Property	GKE Galaxy (before)	GCE VM Galaxy (now)
Backend	GKE cluster + Helm	Single GCE VM (galaxy-k8s-boot `anvil` branch)
VM bootstrap	N/A	GCE `user-data` cloud-init via guest agent
Proxy backend protocol	HTTPS port 443 (nginx ingress TLS)	HTTP port 80 (nginx reverse proxy)
Backend IP	Ingress load balancer IP (external)	VM external IP
Proxy reachability	Via GKE LoadBalancer service IP	Via `leonardo-allow-http` firewall (0.0.0.0/0 → port 80, `leonardo` tag)
Readiness check	`isProxyAvailable` via proxy hostname	`isVmReachable` direct HTTP to external IP
Batch jobs	N/A	GCP Batch via `galaxy-batch-runner` SA

Security comparison: old GKE-based vs. new VM-based

Property	Old (GKE-based)	New (VM-based)
Protocol	HTTPS (mTLS)	Plain HTTP
Target IP	GKE load balancer (external)	VM external IP
Port exposed	443	80
Firewall source range	`0.0.0.0/0`	`0.0.0.0/0`
Certificate validation	Yes — Leo-issued cert on nginx	None

Regressions introduced:

No encryption — traffic between Leo's pod and the Galaxy VM crosses the public internet in plaintext
Direct VM access — port 80 is open to 0.0.0.0/0, so anyone who discovers the VM's external IP can reach Galaxy directly, bypassing Leo's authentication

Alternatives for follow-up:

Option A — VPC peering / Private Service Connect (recommended structural fix): peer Leo's project VPC with the user's workspace VPC so Leo can reach the VM on its internal IP; the 0.0.0.0/0 firewall rule is no longer needed
Option B — HTTPS on the Galaxy VM (closest to old security posture): configure nginx on the Galaxy VM with Leo's CA certificate (as Jupyter VMs do); flip useHttp = false for Galaxy; requires provisioning Leo certs onto the VM during installGalaxyVm
Option C — GCP IAP or Cloud Armor (lower-effort mitigation): restricts direct VM access without VPC changes, but does not encrypt the Leo→VM leg

Test plan

Unit tests pass (GKEInterpreterSpec, LeoPubsubMessageSubscriberSpec)
Scala formatting clean
BEE: create Galaxy app → VM boots, Ansible runs, status goes to Running
BEE: access Galaxy through Leo proxy URL
BEE: verify workspace user email is the Galaxy admin (not a hardcoded address)
BEE: delete app keeping disks → VM deleted, disks remain
BEE: re-create app from existing disks → restore_galaxy=true passed to Ansible, Galaxy restores state
Follow-up: restrict leonardo-allow-http source range from 0.0.0.0/0 to Leo's GKE node CIDR

At Galaxy VM creation, grant the pet SA roles/batch.jobsEditor on the user's project so it can submit and monitor GCP Batch jobs. When the Batch SA lives in the same project, also grant serviceAccountUser on it; cross-project Batch SAs must have that binding configured externally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-04-08T14:43:28Z

Codecov Report

❌ Patch coverage is 80.41237% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.86%. Comparing base (5b339ea) to head (8fb089b).
⚠️ Report is 12 commits behind head on develop.

Files with missing lines	Patch %	Lines
.../dsde/workbench/leonardo/util/GKEInterpreter.scala	85.43%	22 Missing ⚠️
...itute/dsde/workbench/leonardo/dao/HttpAppDAO.scala	0.00%	9 Missing ⚠️
...de/workbench/leonardo/dns/KubernetesDnsCache.scala	0.00%	3 Missing ⚠️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala	0.00%	2 Missing ⚠️
...workbench/leonardo/http/service/ProxyService.scala	80.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4904      +/-   ##
===========================================
- Coverage    74.08%   73.86%   -0.22%     
===========================================
  Files          131      131              
  Lines        11100    11166      +66     
  Branches       895      901       +6     
===========================================
+ Hits          8223     8248      +25     
- Misses        2877     2918      +41

Files with missing lines	Coverage Δ
...titute/dsde/workbench/leonardo/config/Config.scala	`97.79% <100.00%> (+0.03%)`	⬆️
...orkbench/leonardo/config/KubernetesAppConfig.scala	`95.00% <ø> (ø)`
...stitute/dsde/workbench/leonardo/dao/ProxyDAO.scala	`25.00% <ø> (ø)`
...orkbench/leonardo/db/PersistentDiskComponent.scala	`97.29% <100.00%> (-2.02%)`	⬇️
...ch/leonardo/http/service/LeoAppServiceInterp.scala	`84.34% <ø> (ø)`
.../leonardo/monitor/LeoPubsubMessageSubscriber.scala	`76.80% <100.00%> (-0.20%)`	⬇️
...workbench/leonardo/util/BuildHelmChartValues.scala	`97.87% <ø> (-0.47%)`	⬇️
...tute/dsde/workbench/leonardo/util/GKEAlgebra.scala	`80.00% <ø> (-20.00%)`	⬇️
...e/dsde/workbench/leonardo/dao/HttpJupyterDAO.scala	`19.44% <0.00%> (-0.56%)`	⬇️
...workbench/leonardo/http/service/ProxyService.scala	`77.00% <80.00%> (-0.56%)`	⬇️
... and 3 more

... and 2 files with indirect coverage changes

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b339ea...8fb089b. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

LizBaldo · 2026-04-14T12:37:07Z

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM:
Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

LizBaldo · 2026-04-14T14:23:33Z

I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM: Required 'compute.images.useReadOnly' permission for 'projects/anvil-and-terra-development/global/images/galaxy-k8s-boot-v2026-02-25'

I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project

The Galaxy team made the image public so I am currently unblocked :)

aednichols · 2026-04-14T14:47:25Z

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

LizBaldo · 2026-04-14T14:56:51Z

the Galaxy VM is reachable directly from the internet

I'm not sure this is a concern, don't the instances live in a VPC with NAT?

I agree, but this is a departure from how we used to handle it, I am not sure that the compliance review covered this so I want to triple check before merging

…, gcpBatchSaProject parsing - Pass app.auditInfo.creator as galaxy-user-email GCE metadata so the actual workspace user (not dev@galaxyproject.org) becomes the Galaxy admin - Fix scala.io.Source resource leak in installGalaxyVm using scala.util.Using - Update sourceImage to galaxy-k8s-boot-v2026-06-10 and gitBranch to "anvil" - Fix HOST_IP to use GCE metadata server instead of external ifconfig.me - Fix gcpBatchSaProject SA email parsing: lift(1) + stripSuffix instead of lastOption + replace to avoid matching suffix in unexpected positions - Correct stale comments: galaxy_url_prefix → galaxy_prefix, dev → anvil branch, wrong "internal IP" comment corrected to "external IP" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The pre-baked galaxy-k8s-boot image carries cloud-init state from its build, so cloud-init treats new VM launches as subsequent boots and skips runcmd — causing "No startup scripts to run" in Guest Agent logs and a VM that never bootstraps Galaxy. Fix: pass the bootstrap script as the "startup-script" metadata key instead of "user-data". The GCE Guest Agent always executes startup-script on boot, regardless of cloud-init state. galaxy-user-data.sh is reformatted from cloud-config YAML to a plain bash script. The sudo -u debian block now uses a single-quoted heredoc delimiter (<<'DEBIAN_EOF') to avoid apostrophes in comments breaking shell quoting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n playbook.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ty with anvil playbook (helm list --all) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…th check Two bugs found during BEE testing: 1. Blank page on Galaxy load: the anvil playbook's galaxy_prefix only configures the nginx ingress path, not Galaxy's own galaxy_url_prefix in galaxy.yml. Without it Galaxy generates /static/... links that resolve without the Leo proxy prefix, leaving a blank page. Fixed by passing galaxy_helm_extra_sets via an extra-vars YAML file so Helm also sets configs.galaxy\.yml.galaxy_url_prefix. 2. App marked Running too early: isVmReachable was polling GET / which returns 404 from nginx (no ingress rule at root) — 404 < 500 = true — so Leo marked the app Running before Galaxy pods were ready. Changed to poll the galaxy prefix path and require status < 400 so 404 (no ingress yet) and 502 (pods starting) both keep polling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…is literal YAML double-quoted strings treat \ as escape, making \.yml invalid. Single-quoted YAML scalars treat backslash as literal, which is what Helm needs for the dotted key configs.galaxy\.yml.galaxy_url_prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cloudve/galaxy chart stores Galaxy config under configs.galaxy.yml.galaxy.* so the correct key is configs.galaxy\.yml.galaxy.galaxy_url_prefix, not configs.galaxy\.yml.galaxy_url_prefix. The previous key wrote the value at the wrong level of galaxy.yml where Galaxy doesn't read it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ss probe Three changes: - ProxyService: add Authorization to HeadersToFilter so Leo no longer forwards the Terra Bearer JWT to backends. Galaxy 23.1+ treats any Authorization: Bearer value as a Galaxy API key; the Terra JWT fails validation and every API call returns 400, causing a blank page. Leo is the auth boundary — backends should not receive user tokens. - GKEInterpreter: poll /api/version instead of the bare galaxy prefix path for the VM readiness check. The bare path can return a 301 redirect (nginx trailing-slash normalisation) before Galaxy's Python backend is ready, satisfying < 400 and marking Running too early. - galaxy-user-data.sh: remove the galaxy_helm_extra_sets extra-vars block. The CloudVE chart already auto-wires galaxy_url_prefix from ingress.path via double-tpl, so the override is redundant. It also risks breaking restore mode: the playbook's set_fact overrides galaxy_helm_extra_sets with PVC values for restore, and if Ansible extra-vars win precedence, restore mode fails to apply existingClaim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… AppRestore.Other for Galaxy When a VM-based Galaxy app is deleted with disks kept, Leo never called updateLastUsedBy/updateGalaxyDiskRestore so the disk's appRestore was always None. On re-create, LeoAppServiceInterp threw "no restore info found in DB". Three coordinated fixes: - LeoPubsubMessageSubscriber.deleteApp: call updateLastUsedBy for the data disk whenever the disk is being kept (msg.diskId is absent); works for both GKE and VM Galaxy. - PersistentDiskComponent: when formattedBy=Galaxy but galaxyPvcId is null (VM path has no Kubernetes PVC), map to AppRestore.Other(lastUsedBy) instead of producing None. - LeoAppServiceInterp: accept AppRestore.Other for Galaxy in the restore success match so VM apps that have AppRestore.Other (rather than GalaxyRestore) can be re-created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LizBaldo · 2026-06-16T20:50:46Z

@afgane I was able to deploy a new Galaxy app, then delete it while keeping the disk and then recreating a new app using the same disk 🎉 I'll do one final round of testing once you have the latest image and then we should be good to go :)

…ore.Other not None GKEInterpreter.createAndPollApp already calls updateLastUsedBy after installGalaxyVm (line 431), so the keep-disk delete path does not need to set it separately. The prior commit's PersistentDiskComponent change now correctly maps a Galaxy disk with lastUsedBy set but no galaxyPvcId to AppRestore.Other instead of None. That exposed a fragile test assertion that assumed appRestore would be None and used an unsafe asInstanceOf[GalaxyRestore] cast. Updated assertion to expect Some(AppRestore.Other(appId)). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace GKE/Helm Galaxy path with GCE VM-based deployment

0e070c0

LizBaldo requested a review from a team as a code owner April 7, 2026 17:39

Liz Baldo and others added 4 commits April 7, 2026 13:51

fix format

8b9dffd

add SA permissions and fix unit tests

d48a60f

fix the last two unit tests

8bf2432

LizBaldo requested a review from lucymcnatt April 8, 2026 15:58

Liz Baldo added 3 commits April 10, 2026 11:17

handle Batch SA creation and fix restore Galaxy

8196bea

fix compiler issues

95fa6b8

Fix Leo proxy for Galaxy VM: use HTTP on port 80 via VM internal IP

86a4966

LizBaldo requested a review from afgane April 13, 2026 13:00

Liz Baldo added 3 commits April 13, 2026 09:39

Fix Galaxy VM readiness check: use direct HTTP to VM internal IP

bea6e9d

Fix Galaxy VM proxy: use external IP instead of internal IP

c28fcca

Fix Galaxy VM bootstrap: use galaxy-k8s-boot image

d8683d8

increase boot disk size to Galaxy VM disk image reqs

66135f3

Liz Baldo added 4 commits April 14, 2026 16:34

mark cluster as deleted and wait for iam to propagate

98ec831

fix Gb to Gib conversion

b350284

strip leo proxy prefix

25c6605

pass the galaxy_url_prefix

2739d88

afgane mentioned this pull request Apr 15, 2026

Add support for serving Galaxy at a proxy prefix galaxyproject/galaxy-k8s-boot#63

Merged

Liz Baldo and others added 5 commits May 8, 2026 13:21

use new proxy path env variable

a98b353

Fix ansible inventory group name: [vm] -> [vms] to match hosts: vms i…

aa73c84

…n playbook.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Revert sourceImage to v2026-02-25: v2026-06-10 has Helm incompatibili…

8823335

…ty with anvil playbook (helm list --all) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Liz Baldo and others added 6 commits June 15, 2026 11:18

format fix

d8b4372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904

CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904
LizBaldo wants to merge 28 commits into
developfrom
CTM-397-deploy-galaxy-on-GCE

LizBaldo commented Apr 7, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

aednichols commented Apr 14, 2026

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

LizBaldo commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LizBaldo commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Architecture notes

Security comparison: old GKE-based vs. new VM-based

Test plan

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

aednichols commented Apr 14, 2026

Uh oh!

LizBaldo commented Apr 14, 2026

Uh oh!

LizBaldo commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LizBaldo commented Apr 7, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading