CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904
CTM-397: Replace GKE/Helm Galaxy path with GCE VM-based deployment#4904LizBaldo wants to merge 28 commits into
Conversation
At Galaxy VM creation, grant the pet SA roles/batch.jobsEditor on the user's project so it can submit and monitor GCP Batch jobs. When the Batch SA lives in the same project, also grant serviceAccountUser on it; cross-project Batch SAs must have that binding configured externally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #4904 +/- ##
===========================================
- Coverage 74.08% 73.86% -0.22%
===========================================
Files 131 131
Lines 11100 11166 +66
Branches 895 901 +6
===========================================
+ Hits 8223 8248 +25
- Misses 2877 2918 +41
... and 2 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
I am currently blocked fro testing further because of a lack of permission on the galaxy image to use on the boot VM: I think I would need the Galaxy team to make the image allAuthenticatedUsers in the anvil-and-terra-development project |
The Galaxy team made the image public so I am currently unblocked :) |
I'm not sure this is a concern, don't the instances live in a VPC with NAT? |
I agree, but this is a departure from how we used to handle it, I am not sure that the compliance review covered this so I want to triple check before merging |
…, gcpBatchSaProject parsing - Pass app.auditInfo.creator as galaxy-user-email GCE metadata so the actual workspace user (not dev@galaxyproject.org) becomes the Galaxy admin - Fix scala.io.Source resource leak in installGalaxyVm using scala.util.Using - Update sourceImage to galaxy-k8s-boot-v2026-06-10 and gitBranch to "anvil" - Fix HOST_IP to use GCE metadata server instead of external ifconfig.me - Fix gcpBatchSaProject SA email parsing: lift(1) + stripSuffix instead of lastOption + replace to avoid matching suffix in unexpected positions - Correct stale comments: galaxy_url_prefix → galaxy_prefix, dev → anvil branch, wrong "internal IP" comment corrected to "external IP" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pre-baked galaxy-k8s-boot image carries cloud-init state from its build, so cloud-init treats new VM launches as subsequent boots and skips runcmd — causing "No startup scripts to run" in Guest Agent logs and a VM that never bootstraps Galaxy. Fix: pass the bootstrap script as the "startup-script" metadata key instead of "user-data". The GCE Guest Agent always executes startup-script on boot, regardless of cloud-init state. galaxy-user-data.sh is reformatted from cloud-config YAML to a plain bash script. The sudo -u debian block now uses a single-quoted heredoc delimiter (<<'DEBIAN_EOF') to avoid apostrophes in comments breaking shell quoting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n playbook.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ty with anvil playbook (helm list --all) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…th check Two bugs found during BEE testing: 1. Blank page on Galaxy load: the anvil playbook's galaxy_prefix only configures the nginx ingress path, not Galaxy's own galaxy_url_prefix in galaxy.yml. Without it Galaxy generates /static/... links that resolve without the Leo proxy prefix, leaving a blank page. Fixed by passing galaxy_helm_extra_sets via an extra-vars YAML file so Helm also sets configs.galaxy\.yml.galaxy_url_prefix. 2. App marked Running too early: isVmReachable was polling GET / which returns 404 from nginx (no ingress rule at root) — 404 < 500 = true — so Leo marked the app Running before Galaxy pods were ready. Changed to poll the galaxy prefix path and require status < 400 so 404 (no ingress yet) and 502 (pods starting) both keep polling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…is literal YAML double-quoted strings treat \ as escape, making \.yml invalid. Single-quoted YAML scalars treat backslash as literal, which is what Helm needs for the dotted key configs.galaxy\.yml.galaxy_url_prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cloudve/galaxy chart stores Galaxy config under configs.galaxy.yml.galaxy.* so the correct key is configs.galaxy\.yml.galaxy.galaxy_url_prefix, not configs.galaxy\.yml.galaxy_url_prefix. The previous key wrote the value at the wrong level of galaxy.yml where Galaxy doesn't read it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ss probe Three changes: - ProxyService: add Authorization to HeadersToFilter so Leo no longer forwards the Terra Bearer JWT to backends. Galaxy 23.1+ treats any Authorization: Bearer value as a Galaxy API key; the Terra JWT fails validation and every API call returns 400, causing a blank page. Leo is the auth boundary — backends should not receive user tokens. - GKEInterpreter: poll /api/version instead of the bare galaxy prefix path for the VM readiness check. The bare path can return a 301 redirect (nginx trailing-slash normalisation) before Galaxy's Python backend is ready, satisfying < 400 and marking Running too early. - galaxy-user-data.sh: remove the galaxy_helm_extra_sets extra-vars block. The CloudVE chart already auto-wires galaxy_url_prefix from ingress.path via double-tpl, so the override is redundant. It also risks breaking restore mode: the playbook's set_fact overrides galaxy_helm_extra_sets with PVC values for restore, and if Ansible extra-vars win precedence, restore mode fails to apply existingClaim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… AppRestore.Other for Galaxy When a VM-based Galaxy app is deleted with disks kept, Leo never called updateLastUsedBy/updateGalaxyDiskRestore so the disk's appRestore was always None. On re-create, LeoAppServiceInterp threw "no restore info found in DB". Three coordinated fixes: - LeoPubsubMessageSubscriber.deleteApp: call updateLastUsedBy for the data disk whenever the disk is being kept (msg.diskId is absent); works for both GKE and VM Galaxy. - PersistentDiskComponent: when formattedBy=Galaxy but galaxyPvcId is null (VM path has no Kubernetes PVC), map to AppRestore.Other(lastUsedBy) instead of producing None. - LeoAppServiceInterp: accept AppRestore.Other for Galaxy in the restore success match so VM apps that have AppRestore.Other (rather than GalaxyRestore) can be re-created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@afgane I was able to deploy a new Galaxy app, then delete it while keeping the disk and then recreating a new app using the same disk 🎉 I'll do one final round of testing once you have the latest image and then we should be good to go :) |
…ore.Other not None GKEInterpreter.createAndPollApp already calls updateLastUsedBy after installGalaxyVm (line 431), so the keep-disk delete path does not need to set it separately. The prior commit's PersistentDiskComponent change now correctly maps a Galaxy disk with lastUsedBy set but no galaxyPvcId to AppRestore.Other instead of None. That exposed a fragile test assertion that assumed appRestore would be None and used an unsafe asInstanceOf[GalaxyRestore] cast. Updated assertion to expect Some(AppRestore.Other(appId)). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Replaces the GKE/Helm-based Galaxy deployment path with a GCE VM-based deployment using galaxy-k8s-boot. Rather than provisioning a full GKE cluster and deploying Galaxy via Helm, Leo now creates a single GCE VM that runs Galaxy via Ansible/microk8s.
What changed
Galaxy VM provisioning (
GKEInterpreter.installGalaxyVm)galaxy-user-emailGCE metadata (fromapp.auditInfo.creator) so the workspace user becomes the Galaxy admingalaxy-batch-runnerservice account in the user's projectroles/batch.jobsEditorat the project level androles/iam.serviceAccountUseron the Batch SA, so the VM can submit GCP Batch jobsleonardo-galaxy-allow-nfs-for-batch) so Batch VMs can reach the Galaxy VM's NFS server (TCP/UDP 2049 and 111)loadBalancerIpso the Leo proxy can reach the VM across VPC boundariesNetwork topology and IP choice
10.x.x.x) is not routable from Leo's podleonardo-allow-httpfirewall rule (TCP port 80, source0.0.0.0/0, targeting VMs with theleonardonetwork tag) allows Leo to reach the VM's external IP. Galaxy VMs are created with theleonardotagisVmReachable) and the Akka HTTP proxy use the external IPGalaxy VM readiness health check
isProxyAvailableroutes through Leo's own proxy hostname; in BEE environments the wildcard DNS resolves to the ingress controller's external IP, unreachable via hairpin NAT from within the GKE pod → TCP timeoutAppDAO.isVmReachable(ip, port)— a direct http4s HTTP GET tohttp://<externalIp>:80/. No proxy hostname resolution required.MockAppDAOreturnsIO.pure(isUp)for testsLeo proxy: HTTP support for Galaxy VM backends
useHttp: Boolean = falsetoHostReady— when true the proxy connects via plain HTTP port 80 (ws:// for WebSocket) instead of HTTPS port 443KubernetesDnsCachesetsuseHttp = trueforAppType.Galaxyapps and maps the fake proxy hostname to the VM's external IPProxyService.handleHttpRequest/handleWebSocketRequestbranch onuseHttp; all non-Galaxy backends are unchangedLeo proxy: path handling for Galaxy VM (
ProxyService.proxyAppRequest)/proxy/google/v1/apps/{project}/{app}/galaxy/...) to the Galaxy VM unchangedingress.pathto the value ofgalaxy_prefix, which Leo passes as a GCE metadata item (galaxy-url-prefix). Galaxy's nginx therefore serves at the full Leo proxy path, so all requests route correctly without any path rewriting in LeoNFS PVC size: GB → GiB conversion fix (
GKEInterpreter.installGalaxyVm)pvSizeGi = nfsDisk.size.gb - 11treated decimal GB as binary GiB. For a 500 GB disk: the disk holds ~466 GiB but Leo requested 489 GiB → NFS provisioner fails withinsufficient available space, leaving all Galaxy podsPendingdiskSizeGiB = (nfsDisk.size.gb.toLong * 1000^3) / 1024^3, then subtract 11 GiB overheadLifecycle: restore from existing disks
restore = msg.appType == AppType.Galaxy && msg.createDisk.isEmpty: when a Galaxy app is created without a new disk, the disks already exist (prior app was deleted keeping disks)restore_galaxy=truemetadata to AnsibleCreateAppParams.restore: Booleanpropagates the flag through toinstallGalaxyVmLifecycle: delete keeping disks
DeleteAppMessage(diskId = None)→ VM is deleted, both disks are preservedDeleteAppMessage(diskId = Some(...))→ VM + both disks deletedConfig cleanup
gcpBatchServiceAccountEmailfromreference.conf/GalaxyVmConfig/Config.scala— Leo now creates the SA dynamically viagetOrCreateServiceAccountinstead of relying on a pre-configured emailArchitecture notes
anvilbranch)user-datacloud-init via guest agentleonardo-allow-httpfirewall (0.0.0.0/0 → port 80,leonardotag)isProxyAvailablevia proxy hostnameisVmReachabledirect HTTP to external IPgalaxy-batch-runnerSASecurity comparison: old GKE-based vs. new VM-based
0.0.0.0/00.0.0.0/0Regressions introduced:
0.0.0.0/0, so anyone who discovers the VM's external IP can reach Galaxy directly, bypassing Leo's authenticationAlternatives for follow-up:
0.0.0.0/0firewall rule is no longer neededuseHttp = falsefor Galaxy; requires provisioning Leo certs onto the VM duringinstallGalaxyVmTest plan
GKEInterpreterSpec,LeoPubsubMessageSubscriberSpec)restore_galaxy=truepassed to Ansible, Galaxy restores stateleonardo-allow-httpsource range from0.0.0.0/0to Leo's GKE node CIDR