feat: systemd support#295
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request introduces preliminary support for managing database instances using SystemD orchestration, alongside the existing Docker Swarm support. Changes include new SystemD orchestrator implementation, updated API schemas, test infrastructure enhancements, development environment tools using Lima, and corresponding service/workflow updates. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5b65dd1 to
7512d78
Compare
0292169 to
fecca31
Compare
3b2cb2d to
1811d33
Compare
8cb89f9 to
c752911
Compare
32a39ff to
1ef89d0
Compare
c752911 to
9c4262c
Compare
32ef891 to
5c8c403
Compare
9c4262c to
e179f4c
Compare
5c8c403 to
79fbfc1
Compare
e179f4c to
40f2797
Compare
79fbfc1 to
b1b5cae
Compare
40f2797 to
0e319c3
Compare
b1b5cae to
3b749f2
Compare
0e319c3 to
900598e
Compare
3b749f2 to
ccf2e10
Compare
900598e to
e181cc5
Compare
ccf2e10 to
7607b74
Compare
e181cc5 to
42824fe
Compare
7607b74 to
09c6ec7
Compare
42824fe to
89dca9e
Compare
317ed90 to
cff22d3
Compare
984f974 to
470c226
Compare
cff22d3 to
5855a34
Compare
470c226 to
29fd4e0
Compare
5855a34 to
6427259
Compare
29fd4e0 to
620982a
Compare
6427259 to
2ae3509
Compare
620982a to
0eb8c20
Compare
2ae3509 to
122cbd5
Compare
0eb8c20 to
2452275
Compare
122cbd5 to
0dd6130
Compare
2452275 to
ef68e04
Compare
0dd6130 to
8e96c53
Compare
ef68e04 to
3049b1e
Compare
8e96c53 to
80a92a0
Compare
fea717a to
fe5ee10
Compare
80a92a0 to
40fd983
Compare
fe5ee10 to
9f7c75f
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
server/internal/postgres/hba/entry.go (1)
63-67:⚠️ Potential issue | 🟠 MajorAvoid CIDR normalization when
Maskis present.Line 63 always rewrites bare IPs to CIDR, but Lines 65-67 can still append a separate mask. That can emit mixed address formats in one rule. Keep raw
e.Addresswhene.Maskis non-empty.Suggested fix
func (e Entry) String() string { var entry string switch e.Type { case EntryTypeInclude, EntryTypeIncludeIfExists, EntryTypeIncludeDir: return fmt.Sprintf("%-17s %s", e.Type, e.IncludePath) default: - entry = fmt.Sprintf("%-7s %-15s %-15s %-23s", e.Type, e.Database, e.User, transformAddress(e.Address)) + address := e.Address + if e.Mask == "" { + address = transformAddress(address) + } + entry = fmt.Sprintf("%-7s %-15s %-15s %-23s", e.Type, e.Database, e.User, address) }Also applies to: 76-93
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/postgres/hba/entry.go` around lines 63 - 67, The code currently always calls transformAddress(e.Address) when building the hba entry (the variable entry) which normalizes bare IPs to CIDR even when e.Mask is set, leading to mixed formats; change the logic in the entry construction to use the raw e.Address (do not call transformAddress) whenever e.Mask != "" so the address and separate mask remain unmodified, and apply the same correction to the other HBA entry construction block that mirrors this behavior (the block covering the subsequent entry building around the transformAddress usage for e.Mask handling).server/internal/database/spec.go (1)
554-580:⚠️ Potential issue | 🟠 Major
InstanceSpec.Clone()dropsAllHostIDsand can break downstream systemd config generation.Line 560+ builds a cloned
InstanceSpec, but the new field from Line 527 is never copied. Any flow that relies on cloned specs will silently lose peer host membership.💡 Proposed fix
func (s *InstanceSpec) Clone() *InstanceSpec { users := make([]*User, len(s.DatabaseUsers)) for i, user := range s.DatabaseUsers { users[i] = user.Clone() } return &InstanceSpec{ InstanceID: s.InstanceID, TenantID: utils.ClonePointer(s.TenantID), DatabaseID: s.DatabaseID, HostID: s.HostID, DatabaseName: s.DatabaseName, NodeName: s.NodeName, NodeOrdinal: s.NodeOrdinal, PgEdgeVersion: s.PgEdgeVersion.Clone(), Port: utils.ClonePointer(s.Port), PatroniPort: utils.ClonePointer(s.PatroniPort), CPUs: s.CPUs, MemoryBytes: s.MemoryBytes, DatabaseUsers: users, BackupConfig: s.BackupConfig.Clone(), RestoreConfig: s.RestoreConfig.Clone(), PostgreSQLConf: maps.Clone(s.PostgreSQLConf), ClusterSize: s.ClusterSize, OrchestratorOpts: s.OrchestratorOpts.Clone(), + AllHostIDs: slices.Clone(s.AllHostIDs), } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/database/spec.go` around lines 554 - 580, InstanceSpec.Clone() currently omits the AllHostIDs field so cloned specs lose peer host membership; update the returned struct in InstanceSpec.Clone to include a cloned copy of s.AllHostIDs (e.g., deep-copy the slice rather than assigning the original) so the new InstanceSpec preserves AllHostIDs alongside the other cloned fields (use the project’s slice-clone helper if present or create a new slice and copy elements) — modify the return value in InstanceSpec.Clone to set AllHostIDs: <cloned slice of s.AllHostIDs>.e2e/cancel_task_test.go (1)
48-55:⚠️ Potential issue | 🟡 MinorDead code after
require.NoError.The
require.NoError(t, err)at line 48 will terminate the test iferr != nil, making the subsequent error check at lines 52-55 unreachable.🧹 Suggested fix: remove the redundant error handling
}) require.NoError(t, err) - creation_task := create_resp.Task database := create_resp.Database - if err != nil { - t.Logf("problem creating test db %s", err) - return - } t.Logf("successfully created cancel task test db")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/cancel_task_test.go` around lines 48 - 55, The test contains redundant dead-code: after calling require.NoError(t, err) which fails the test on error, the subsequent if err != nil { t.Logf(...); return } block is unreachable; remove that redundant error check and logging (the variables creation_task := create_resp.Task and database := create_resp.Database can remain) so the test relies on require.NoError for failure handling and does not contain the unused conditional that checks err.server/internal/api/apiv1/validate.go (2)
212-277:⚠️ Potential issue | 🟠 MajorValidate the effective SystemD port pair, not just node overrides.
Once a node can inherit
db.Port/db.PatroniPort,validatePorts(node.Port, node.PatroniPort, ...)is no longer sufficient. A spec likedb.Port=5432plusnode.PatroniPort=5432currently passes, even though postgres and patroni would bind the same host port at runtime. Compute the effective postgres/patroni ports first, then run both the conflict check and the required-field check on that pair.💡 Suggested change
- portPath := appendPath(path, "port") - errs = append(errs, validatePorts(node.Port, node.PatroniPort, portPath)) + portPath := appendPath(path, "port") + patroniPortPath := appendPath(path, "patroni_port") + effectivePostgresPort := node.Port + effectivePatroniPort := node.PatroniPort + if orchestrator == config.OrchestratorSystemD { + if effectivePostgresPort == nil { + effectivePostgresPort = db.Port + } + if effectivePatroniPort == nil { + effectivePatroniPort = db.PatroniPort + } + } + errs = append(errs, validatePorts(effectivePostgresPort, effectivePatroniPort, portPath)) @@ switch orchestrator { case config.OrchestratorSystemD: - if db.Port == nil && node.Port == nil { - portPath := appendPath(path, "port") - errs = append(errs, newValidationError(errors.New("port must be defined"), portPath)) + if effectivePostgresPort == nil { + errs = append(errs, newValidationError(errors.New("port must be defined"), portPath)) } - if db.PatroniPort == nil && node.PatroniPort == nil { - portPath := appendPath(path, "patroni_port") - errs = append(errs, newValidationError(errors.New("patroni_port must be defined"), portPath)) + if effectivePatroniPort == nil { + errs = append(errs, newValidationError(errors.New("patroni_port must be defined"), patroniPortPath)) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/api/apiv1/validate.go` around lines 212 - 277, validateNode currently calls validatePorts(node.Port, node.PatroniPort, ...) and then separately checks db.Port/db.PatroniPort presence, which misses conflicts when a node inherits db-level ports; compute the effective ports by taking node.Port if non-nil else db.Port (and same for PatroniPort) inside validateNode, pass those effective values to validatePorts (so port pair conflict detection uses the runtime values), and use those same effective values when enforcing the "must be defined" checks for SystemD (replace db.Port==nil && node.Port==nil with effective postgres port == nil, and same for patroni_port) while keeping existing path/validation-error creation logic.
65-95:⚠️ Potential issue | 🟠 MajorAlso reject same-host port reuse across nodes for SystemD specs.
The current loop validates nodes independently, so a spec can still place two nodes on the same
host_idand give them the same effectiveportorpatroni_port. Under SystemD those listeners are bound directly on the host, so one of the instances will fail to start. This needs a per-host port-conflict check while iteratingspec.Nodes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/api/apiv1/validate.go` around lines 65 - 95, validateDatabaseSpec currently validates nodes independently and misses conflicts where two nodes on the same host (host_id) request the same effective port or patroni_port under SystemD; add a per-host port-usage tracker while iterating spec.Nodes (inside validateDatabaseSpec) that maps host_id -> set of used ports and patroni_ports, compute each node's effective port and patroni_port (use node.Port and node.PatroniPort / utils.FromPointer logic consistent with validatePorts), and when you see a port already used on the same host add a validation error via newValidationError (use the same nodePath like other node errors) describing the duplicate port; ensure this check only applies for SystemD orchestrator values and include both listener ports (port and patroni_port) so two nodes on the same host cannot reuse the same listener.
🧹 Nitpick comments (13)
lima/roles/generate_test_config/tasks/main.yaml (1)
7-10: Use FQCN consistently for thetemplatemodule.Line 3 uses the fully qualified collection name (
ansible.builtin.file), but line 8 uses the short form (template). For consistency and to avoid potential name collisions, useansible.builtin.template.💡 Suggested fix for FQCN consistency
- name: Generate test_config - template: + ansible.builtin.template: src: test_config.yaml.tmpl dest: ../e2e/fixtures/outputs/dev-lima.test_config.yaml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/roles/generate_test_config/tasks/main.yaml` around lines 7 - 10, The task named "Generate test_config" uses the short module name "template"; change it to the fully-qualified collection name "ansible.builtin.template" to match the FQCN usage elsewhere and avoid collisions—update the template invocation in the "Generate test_config" task so its module is "ansible.builtin.template" while keeping the same src and dest parameters.lima/roles/stop_dbs/tasks/main.yaml (1)
2-3: Addchanged_whenfor accurate idempotency reporting.The
commandmodule always reportschangedstatus. Addingchanged_when: false(sincereset-failedis idempotent) improves playbook output accuracy and allows proper--checkmode behavior.💡 Suggested fix
- name: Reset failed services ansible.builtin.command: systemctl reset-failed + changed_when: false🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/roles/stop_dbs/tasks/main.yaml` around lines 2 - 3, The task named "Reset failed services" currently uses ansible.builtin.command: systemctl reset-failed which always reports changed; update that task to explicitly set changed_when: false so the task is idempotent in reporting (and behaves correctly in --check mode). Locate the task by the name "Reset failed services" and add changed_when: false under the task definition to prevent false changed events from this systemctl command.lima/config.json (1)
7-9: Consider more verbose logging for development.Setting
api_serverlog level toerrormay hide useful information when debugging during development. Typically dev configs benefit fromdebugorinfolevel logging.💡 Suggested change for more verbose dev logging
"component_levels": { - "api_server": "error" + "api_server": "debug" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/config.json` around lines 7 - 9, The current development config sets component_levels.api_server to "error", which suppresses useful debug info; update the value for "api_server" under the "component_levels" object to a more verbose level such as "debug" or "info" for local/dev use (keep the key name "component_levels" and the "api_server" identifier unchanged) so that the API server emits more helpful logs during development; ensure this change is limited to dev configuration (not production) and restart the service to pick up the new level.lima/roles/install_prerequisites/tasks/main.yaml (2)
13-17: GPG check disabled for pgEdge repository.Disabling GPG verification (
disable_gpg_check: true) reduces package integrity assurance. While acceptable for a development environment, consider enabling GPG checks if a signed repository becomes available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/roles/install_prerequisites/tasks/main.yaml` around lines 13 - 17, The task labeled "Install pgEdge repository" currently sets disable_gpg_check: true which bypasses package signature verification; change this to enable GPG checks by removing or setting disable_gpg_check to false (or make it configurable via a variable like pgedge_disable_gpg_check with a safe default of false) so the package module performs signature validation; update any documentation/vars to explain toggling only for development and ensure the repository provides a signing key and that the playbook imports it before installation.
82-83: Delve installation task lacks idempotency.The
go installcommand will run on every playbook execution even if Delve is already installed. Consider adding acreatescheck or using a conditional.💡 Proposed fix for idempotency
- name: Install delve debugger - ansible.builtin.command: /usr/local/go/bin/go install github.com/go-delve/delve/cmd/dlv@latest + ansible.builtin.command: + cmd: /usr/local/go/bin/go install github.com/go-delve/delve/cmd/dlv@latest + creates: /root/go/bin/dlvNote: Adjust the
createspath based on the actualGOBINorGOPATHlocation wheredlvgets installed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/roles/install_prerequisites/tasks/main.yaml` around lines 82 - 83, The "Install delve debugger" Ansible task currently always runs `go install github.com/go-delve/delve/cmd/dlv@latest`; make it idempotent by adding a `creates` parameter (e.g. creates: /usr/local/go/bin/dlv or the actual GOBIN/GOPATH bin path such as "{{ lookup('env','GOBIN') | default((lookup('pipe','go env GOPATH') + '/bin')) }}/dlv") or wrap the command behind a conditional that checks for the `dlv` binary (e.g. `when: not lookup('ansible.builtin.file', '/usr/local/go/bin/dlv').exists` or use `stat` to set a fact). Update the task with one of these approaches so the `go install` only runs when `dlv` is missing.lima/run.sh (2)
62-65: Stderr output won't be prefixed with host ID.The
sedfilter only processes stdout. Any error output frompgedge-control-planewill appear without the[host-X]prefix, which could make debugging multi-host issues more difficult.💡 Proposed fix to prefix both stdout and stderr
# The sed prefixes each output line with the host ID. /usr/sbin/pgedge-control-plane run \ --config-path ${LIMA_DIR}/config.json \ - | sed "s/^/[${PGEDGE_HOST_ID}] /" + 2>&1 | sed "s/^/[${PGEDGE_HOST_ID}] /"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/run.sh` around lines 62 - 65, Stderr from pgedge-control-plane is not being piped through sed so error lines lack the [${PGEDGE_HOST_ID}] prefix; update the invocation of pgedge-control-plane (the "pgedge-control-plane run --config-path ${LIMA_DIR}/config.json" command) to redirect stderr into stdout before the pipe (e.g., add "2>&1" or use the shell pipe-fail/operator that merges streams) so that sed receives both streams and prefixes error output as well.
45-48: Minor: Trailing whitespace on line 46.There's trailing whitespace after
instanceson line 46.🧹 Proposed fix
export PGEDGE_DATA_DIR=${LIMA_DIR}/data/${PGEDGE_HOST_ID} - export PGEDGE_SYSTEMD__INSTANCE_DATA_DIR=${LIMA_DIR}/data/${PGEDGE_HOST_ID}/instances - + export PGEDGE_SYSTEMD__INSTANCE_DATA_DIR=${LIMA_DIR}/data/${PGEDGE_HOST_ID}/instances }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/run.sh` around lines 45 - 48, Remove the trailing whitespace at the end of the export for PGEDGE_SYSTEMD__INSTANCE_DATA_DIR: edit the export statement that assigns PGEDGE_SYSTEMD__INSTANCE_DATA_DIR (the line exporting ${LIMA_DIR}/data/${PGEDGE_HOST_ID}/instances) so it ends immediately after "instances" with no trailing spaces or tabs; this is a simple whitespace cleanup and does not change variable names or values.docs/development/running-locally.md (1)
427-428: Minor: Use "macOS" instead of "MacOS".Apple's operating system should be written as "macOS" (lowercase "m", uppercase "OS").
📝 Proposed fix
-Lima is an easy-to-use virtual machine runner that works well on MacOS. +Lima is an easy-to-use virtual machine runner that works well on macOS.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/development/running-locally.md` around lines 427 - 428, Replace the incorrect "MacOS" casing with "macOS" in the documentation sentence that currently reads "Lima is an easy-to-use virtual machine runner that works well on MacOS." (search for that exact string or the sentence in docs/development/running-locally.md) and scan the surrounding text for any other occurrences of "MacOS" to correct them to "macOS" for consistent branding.lima/Makefile (1)
29-30: Markruntarget as.PHONY.The
runtarget should be declared as.PHONYto be consistent with the other targets and to ensure it always executes regardless of whether a file namedrunexists.💡 Proposed fix
.PHONY: fix-clocks fix-clocks: ansible --become -i ./inventory.yaml all -m command -a 'chronyc -a makestep' +.PHONY: run run: build fix-clocks LIMA_DIR=$(shell pwd) circusd ./circus.ini🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lima/Makefile` around lines 29 - 30, Add the run target to the Makefile's phony declarations so it always executes regardless of a file named "run"; specifically, add a .PHONY declaration that includes the run target (e.g., add ".PHONY: run" alongside the other .PHONY entries) so the existing "run: build fix-clocks" rule will always run.server/internal/orchestrator/systemd/provide.go (1)
26-32: Note: TODO for OS-specific package manager selection.The hardcoded
&Dnf{}with the TODO comment is appropriate for preliminary systemd support targeting RHEL-like systems. Consider tracking this as a follow-up issue if Debian/Ubuntu support is planned.Would you like me to open an issue to track adding OS detection for package manager selection?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/systemd/provide.go` around lines 26 - 32, The providePackageManager function currently returns a hardcoded &Dnf{}; implement OS detection and return the correct implementation: add a helper (e.g., isRHELLike or detectPackageManager) that checks /etc/os-release or uname and/or lsb_release, then in providePackageManager choose &Dnf{} for RHEL-like or &Apt{} (or DebianPackageManager) for Debian-like systems and return an error for unsupported OSes; if you prefer to defer, create a tracked follow-up issue and replace the hardcoded return with a clear TODO referencing providePackageManager and Dnf to ensure this is not forgotten.server/internal/orchestrator/systemd/dnf.go (1)
88-89: Consider expanding supported version lists as new releases ship.The hardcoded
supportedPostgresVersionsandsupportedSpockVersionswill need updates when new major versions are released. Consider adding a comment noting this maintenance requirement, or potentially making these configurable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/systemd/dnf.go` around lines 88 - 89, The two hardcoded version lists supportedPostgresVersions and supportedSpockVersions need maintenance when new majors ship; update the file to either (a) add a clear TODO comment above these variables explaining they must be updated on new releases, or (b) make them configurable (e.g., load from configuration or environment and fall back to the current literals) so new versions can be added without code changes; ensure you update any code that references supportedPostgresVersions and supportedSpockVersions to use the new configurable source or retain the literals as a default.server/internal/orchestrator/systemd/pgbackrest_restore.go (1)
224-235: Minor:afero.Fsis invoked twice in the sameCreatecall.The filesystem is already obtained at line 87-90 but is invoked again here. Consider passing
fsas a parameter torenameDataDirto avoid the redundant DI lookup.♻️ Proposed refactor
-func (p *PgBackRestRestore) renameDataDir(rc *resource.Context) error { - fs, err := do.Invoke[afero.Fs](rc.Injector) - if err != nil { - return err - } - +func (p *PgBackRestRestore) renameDataDir(fs afero.Fs) error { if err := fs.Rename(p.Paths.Instance.PgData(), p.Paths.Instance.PgDataRestore()); err != nil { return fmt.Errorf("failed to rename pgdata for restore: %w", err) } return nil }And update the caller:
- err = p.renameDataDir(rc) + err = p.renameDataDir(fs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/systemd/pgbackrest_restore.go` around lines 224 - 235, The method PgBackRestRestore.renameDataDir currently performs a redundant dependency lookup by calling do.Invoke[afero.Fs](rc.Injector) again; change the signature to accept an afero.Fs parameter (e.g., renameDataDir(fs afero.Fs, rc *resource.Context) error), remove the do.Invoke call inside renameDataDir, and use the passed fs to call Rename on p.Paths.Instance.PgData() -> p.Paths.Instance.PgDataRestore(); update all callers to obtain fs once via do.Invoke[afero.Fs](rc.Injector) (as already done earlier) and pass that fs into renameDataDir to avoid the duplicate DI lookup.server/internal/orchestrator/systemd/unit.go (1)
84-121: Consider cleanup on partial failure inCreate.If
LinkUnit,Reload,EnableUnit, orRestartUnitfails after the file is written (line 102), the unit file will exist on disk but the unit may be in an inconsistent state. The resource framework will likely retryCreate, which should be idempotent, but consider whether explicit cleanup or rollback would improve reliability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/systemd/unit.go` around lines 84 - 121, The Create method may leave the unit file on disk when any of client.LinkUnit, Reload, EnableUnit, or RestartUnit fails; modify UnitResource.Create to perform rollback on partial failure: after writing and closing f, if any of LinkUnit/Reload/EnableUnit/RestartUnit returns an error, attempt to undo by (a) calling a reverse client operation if available (e.g., client.UnlinkUnit(ctx, path) or client.DisableUnit/StopUnit), (b) calling client.Reload(ctx) again to apply the undo if applicable, and (c) removing the on-disk file with os.Remove(path) (ensure f is closed before removal); return the original operation error (or wrap it) and suppress/aggregate any cleanup errors. Locate this logic around the calls to LinkUnit, Reload, EnableUnit, RestartUnit in UnitResource.Create and add a small rollback helper or inline cleanup that uses the symbols client, path, and the error returned to perform the undo.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@changes/unreleased/Added-20260309-165318.yaml`:
- Around line 1-3: Update the YAML 'body' value to use the official lowercase
project name by replacing "SystemD" with "systemd" in the Added entry (the body
field in changes/unreleased/Added-20260309-165318.yaml) so the text reads
"Preliminary support for managing database instances with systemd."
In `@e2e/fixture_test.go`:
- Around line 153-160: The code assumes ListHosts returned at least one host and
directly indexes hosts.Hosts[0], which can panic; update the logic after calling
cli.ListHosts(ctx) to check that hosts != nil and len(hosts.Hosts) > 0 (or
equivalent) before accessing hosts.Hosts[0], and if empty return a clear error
(e.g., "no hosts returned from ListHosts") instead of proceeding to assign
orchestrator; ensure this guard is applied where orchestrator :=
hosts.Hosts[0].Orchestrator is set.
In `@lima/lima-template.yaml`:
- Around line 15-16: The VM mount entry currently grants write access to the
entire host home via the YAML keys location: "~" and writable: true; change
writable: true to writable: false (or remove the writable key to rely on
read-only default) and instead add narrowly-scoped mount entries for any
specific paths that must be writable (e.g., explicit host paths under mounts
with writable: true), ensuring the location: "~" mount is read-only to avoid
exposing the whole host home.
In `@server/internal/config/config.go`:
- Around line 380-382: The error message in the default branch appends an error
labeled "host_type" while validating the Orchestrator field; update the message
to correctly reference the Orchestrator (e.g., "orchestrator: unsupported
orchestrator %q") so it uses c.Orchestrator and maintains consistent labeling in
the errs append call inside the same switch/default block in config.go.
In `@server/internal/ds/versions.go`:
- Around line 67-70: Guard MajorVersion() against nil/empty receiver and avoid
slice aliasing: if v is nil or has zero components return &Version{Components:
nil} (or an empty slice) instead of slicing into v.Components, and when
returning the first component use a defensive copy (e.g., use slices.Clone on
v.Components[:1] similar to Clone()) so the returned Version does not share
backing storage with v; update MajorVersion to check len(v.Components) and
return a cloned single-element slice.
In `@server/internal/monitor/instance_monitor.go`:
- Around line 104-108: When transitioning an instance to
database.InstanceStateAvailable you must also clear the persisted error so the
API doesn't surface stale failure messages; update the
m.dbSvc.UpdateInstanceState call (or switch to using updateInstanceStatus) so
the InstanceStateUpdateOptions for m.dbSvc.UpdateInstanceState includes clearing
the stored error (e.g., set the status.Error/instance.Error to nil/empty) when
State == database.InstanceStateAvailable; reference m.dbSvc.UpdateInstanceState,
database.InstanceStateUpdateOptions, updateInstanceStatus(), and
storedToInstance()/instance.Error to locate and ensure the persisted error is
cleared on recovery.
In `@server/internal/orchestrator/systemd/client.go`:
- Around line 92-100: The waitForPid function incorrectly treats non-ESRCH
errors (like EPERM) from os.Process.Signal(syscall.Signal(0)) as “process gone”
and it ignores caller cancellation; change waitForPid to accept a
context.Context parameter and inside its loop only consider the process gone
when Signal(0) returns an error that unwraps to syscall.ESRCH, treat EPERM (or
other permission errors) as the process still present, and select on ctx.Done()
as well as the timeout ticker so the function returns promptly when the caller
cancels; update the caller site that calls waitForPid (the block that currently
calls waitForPid(pid, stopTimeout)) to pass the existing context variable into
the new waitForPid(ctx, pid, stopTimeout) signature.
- Around line 57-74: The StartUnit method (and the analogous functions receiving
job results) currently reads the job result from resCh without checking its
value and blocks without honoring context cancellation; update StartUnit (and
the other unit-operation methods that read res := <-resCh) to use a select that
waits on either resCh or ctx.Done(), returning a context-aware error if ctx is
canceled, and validate the job result string (res) — treat any value other than
"done" (e.g. "failed", "canceled", "timeout", "dependency", "skipped") as an
operation error and return a formatted error including the response and unit
name; ensure you keep the existing pid/logging but return non-nil on non-"done"
responses.
In `@server/internal/orchestrator/systemd/orchestrator.go`:
- Around line 374-376: GenerateInstanceRestoreResources is mutating the caller's
InstanceSpec by setting spec.InPlaceRestore = true before calling
GenerateInstanceResources, which can break retries or later comparisons; fix it
by making a shallow copy (or deep copy if necessary) of the provided
InstanceSpec inside GenerateInstanceRestoreResources, set InPlaceRestore = true
on that copy, and pass the copy to GenerateInstanceResources so the original
spec pointer/values remain unchanged (reference:
GenerateInstanceRestoreResources, GenerateInstanceResources,
spec.InPlaceRestore).
- Around line 499-515: validatePort/checkPortAvailable currently allow 0 and
out-of-range values because checkPortAvailable binds an ephemeral port; update
validation to reject port 0 and any port outside 1-65535 before probing
availability. Specifically, in validatePort (and/or at the top of
checkPortAvailable) add a range check for the dereferenced port value and return
a clear error like "port must be between 1 and 65535" if out of range; only call
net.Listen to probe availability after the range check and keep the existing
ptrEqual behavior intact.
- Around line 260-275: The UnitResource patroniUnit is incorrectly using the
local host config (o.cfg.HostID) for HostID; change patroniUnit.HostID to use
spec.HostID so the resource is anchored to the instance identity. Locate the
patroniUnit construction (the UnitResource literal in orchestrator.go) and
replace the HostID assignment from o.cfg.HostID to spec.HostID to ensure the
unit is instance-scoped and not tied to the local config.
- Around line 361-363: The SystemD orchestrator currently returns
"unimplemented" from GenerateServiceInstanceResources and
GetServiceInstanceStatus which will crash workflows that unconditionally call
these (e.g., plan_update and ServiceInstanceMonitor); fix by adding an explicit
orchestrator-type guard before any service-instance use or by implementing these
functions for SystemD: update callers (plan_update, ServiceInstanceMonitor
construction) to check the orchestrator type on the given ServiceInstanceSpec
and skip/return a no-op when not supported, or implement
GenerateServiceInstanceResources and GetServiceInstanceStatus on Orchestrator to
handle SystemD service units (produce appropriate
database.ServiceInstanceResources from database.ServiceInstanceSpec and return
status), and ensure tests cover both gating and functional behavior for SystemD.
In `@server/internal/orchestrator/systemd/packages.go`:
- Around line 47-58: PackageCmp and InstalledPostgresCmp can panic on nil
fields; add defensive nil checks before dereferencing Version and Postgres: in
PackageCmp verify a and b are non-nil and that a.Version and b.Version are
non-nil and return a sensible ordering (e.g., treat nil < non-nil or equal) or
error value; in InstalledPostgresCmp verify a and b and their Postgres fields
and nested Postgres.Version are non-nil and apply the same ordering rule; update
callers or document behavior if you choose to treat nils as errors, and
reference the InstalledPackage, InstalledPostgres, PackageCmp and
InstalledPostgresCmp symbols when making the changes.
---
Outside diff comments:
In `@e2e/cancel_task_test.go`:
- Around line 48-55: The test contains redundant dead-code: after calling
require.NoError(t, err) which fails the test on error, the subsequent if err !=
nil { t.Logf(...); return } block is unreachable; remove that redundant error
check and logging (the variables creation_task := create_resp.Task and database
:= create_resp.Database can remain) so the test relies on require.NoError for
failure handling and does not contain the unused conditional that checks err.
In `@server/internal/api/apiv1/validate.go`:
- Around line 212-277: validateNode currently calls validatePorts(node.Port,
node.PatroniPort, ...) and then separately checks db.Port/db.PatroniPort
presence, which misses conflicts when a node inherits db-level ports; compute
the effective ports by taking node.Port if non-nil else db.Port (and same for
PatroniPort) inside validateNode, pass those effective values to validatePorts
(so port pair conflict detection uses the runtime values), and use those same
effective values when enforcing the "must be defined" checks for SystemD
(replace db.Port==nil && node.Port==nil with effective postgres port == nil, and
same for patroni_port) while keeping existing path/validation-error creation
logic.
- Around line 65-95: validateDatabaseSpec currently validates nodes
independently and misses conflicts where two nodes on the same host (host_id)
request the same effective port or patroni_port under SystemD; add a per-host
port-usage tracker while iterating spec.Nodes (inside validateDatabaseSpec) that
maps host_id -> set of used ports and patroni_ports, compute each node's
effective port and patroni_port (use node.Port and node.PatroniPort /
utils.FromPointer logic consistent with validatePorts), and when you see a port
already used on the same host add a validation error via newValidationError (use
the same nodePath like other node errors) describing the duplicate port; ensure
this check only applies for SystemD orchestrator values and include both
listener ports (port and patroni_port) so two nodes on the same host cannot
reuse the same listener.
In `@server/internal/database/spec.go`:
- Around line 554-580: InstanceSpec.Clone() currently omits the AllHostIDs field
so cloned specs lose peer host membership; update the returned struct in
InstanceSpec.Clone to include a cloned copy of s.AllHostIDs (e.g., deep-copy the
slice rather than assigning the original) so the new InstanceSpec preserves
AllHostIDs alongside the other cloned fields (use the project’s slice-clone
helper if present or create a new slice and copy elements) — modify the return
value in InstanceSpec.Clone to set AllHostIDs: <cloned slice of s.AllHostIDs>.
In `@server/internal/postgres/hba/entry.go`:
- Around line 63-67: The code currently always calls transformAddress(e.Address)
when building the hba entry (the variable entry) which normalizes bare IPs to
CIDR even when e.Mask is set, leading to mixed formats; change the logic in the
entry construction to use the raw e.Address (do not call transformAddress)
whenever e.Mask != "" so the address and separate mask remain unmodified, and
apply the same correction to the other HBA entry construction block that mirrors
this behavior (the block covering the subsequent entry building around the
transformAddress usage for e.Mask handling).
---
Nitpick comments:
In `@docs/development/running-locally.md`:
- Around line 427-428: Replace the incorrect "MacOS" casing with "macOS" in the
documentation sentence that currently reads "Lima is an easy-to-use virtual
machine runner that works well on MacOS." (search for that exact string or the
sentence in docs/development/running-locally.md) and scan the surrounding text
for any other occurrences of "MacOS" to correct them to "macOS" for consistent
branding.
In `@lima/config.json`:
- Around line 7-9: The current development config sets
component_levels.api_server to "error", which suppresses useful debug info;
update the value for "api_server" under the "component_levels" object to a more
verbose level such as "debug" or "info" for local/dev use (keep the key name
"component_levels" and the "api_server" identifier unchanged) so that the API
server emits more helpful logs during development; ensure this change is limited
to dev configuration (not production) and restart the service to pick up the new
level.
In `@lima/Makefile`:
- Around line 29-30: Add the run target to the Makefile's phony declarations so
it always executes regardless of a file named "run"; specifically, add a .PHONY
declaration that includes the run target (e.g., add ".PHONY: run" alongside the
other .PHONY entries) so the existing "run: build fix-clocks" rule will always
run.
In `@lima/roles/generate_test_config/tasks/main.yaml`:
- Around line 7-10: The task named "Generate test_config" uses the short module
name "template"; change it to the fully-qualified collection name
"ansible.builtin.template" to match the FQCN usage elsewhere and avoid
collisions—update the template invocation in the "Generate test_config" task so
its module is "ansible.builtin.template" while keeping the same src and dest
parameters.
In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 13-17: The task labeled "Install pgEdge repository" currently sets
disable_gpg_check: true which bypasses package signature verification; change
this to enable GPG checks by removing or setting disable_gpg_check to false (or
make it configurable via a variable like pgedge_disable_gpg_check with a safe
default of false) so the package module performs signature validation; update
any documentation/vars to explain toggling only for development and ensure the
repository provides a signing key and that the playbook imports it before
installation.
- Around line 82-83: The "Install delve debugger" Ansible task currently always
runs `go install github.com/go-delve/delve/cmd/dlv@latest`; make it idempotent
by adding a `creates` parameter (e.g. creates: /usr/local/go/bin/dlv or the
actual GOBIN/GOPATH bin path such as "{{ lookup('env','GOBIN') |
default((lookup('pipe','go env GOPATH') + '/bin')) }}/dlv") or wrap the command
behind a conditional that checks for the `dlv` binary (e.g. `when: not
lookup('ansible.builtin.file', '/usr/local/go/bin/dlv').exists` or use `stat` to
set a fact). Update the task with one of these approaches so the `go install`
only runs when `dlv` is missing.
In `@lima/roles/stop_dbs/tasks/main.yaml`:
- Around line 2-3: The task named "Reset failed services" currently uses
ansible.builtin.command: systemctl reset-failed which always reports changed;
update that task to explicitly set changed_when: false so the task is idempotent
in reporting (and behaves correctly in --check mode). Locate the task by the
name "Reset failed services" and add changed_when: false under the task
definition to prevent false changed events from this systemctl command.
In `@lima/run.sh`:
- Around line 62-65: Stderr from pgedge-control-plane is not being piped through
sed so error lines lack the [${PGEDGE_HOST_ID}] prefix; update the invocation of
pgedge-control-plane (the "pgedge-control-plane run --config-path
${LIMA_DIR}/config.json" command) to redirect stderr into stdout before the pipe
(e.g., add "2>&1" or use the shell pipe-fail/operator that merges streams) so
that sed receives both streams and prefixes error output as well.
- Around line 45-48: Remove the trailing whitespace at the end of the export for
PGEDGE_SYSTEMD__INSTANCE_DATA_DIR: edit the export statement that assigns
PGEDGE_SYSTEMD__INSTANCE_DATA_DIR (the line exporting
${LIMA_DIR}/data/${PGEDGE_HOST_ID}/instances) so it ends immediately after
"instances" with no trailing spaces or tabs; this is a simple whitespace cleanup
and does not change variable names or values.
In `@server/internal/orchestrator/systemd/dnf.go`:
- Around line 88-89: The two hardcoded version lists supportedPostgresVersions
and supportedSpockVersions need maintenance when new majors ship; update the
file to either (a) add a clear TODO comment above these variables explaining
they must be updated on new releases, or (b) make them configurable (e.g., load
from configuration or environment and fall back to the current literals) so new
versions can be added without code changes; ensure you update any code that
references supportedPostgresVersions and supportedSpockVersions to use the new
configurable source or retain the literals as a default.
In `@server/internal/orchestrator/systemd/pgbackrest_restore.go`:
- Around line 224-235: The method PgBackRestRestore.renameDataDir currently
performs a redundant dependency lookup by calling
do.Invoke[afero.Fs](rc.Injector) again; change the signature to accept an
afero.Fs parameter (e.g., renameDataDir(fs afero.Fs, rc *resource.Context)
error), remove the do.Invoke call inside renameDataDir, and use the passed fs to
call Rename on p.Paths.Instance.PgData() -> p.Paths.Instance.PgDataRestore();
update all callers to obtain fs once via do.Invoke[afero.Fs](rc.Injector) (as
already done earlier) and pass that fs into renameDataDir to avoid the duplicate
DI lookup.
In `@server/internal/orchestrator/systemd/provide.go`:
- Around line 26-32: The providePackageManager function currently returns a
hardcoded &Dnf{}; implement OS detection and return the correct implementation:
add a helper (e.g., isRHELLike or detectPackageManager) that checks
/etc/os-release or uname and/or lsb_release, then in providePackageManager
choose &Dnf{} for RHEL-like or &Apt{} (or DebianPackageManager) for Debian-like
systems and return an error for unsupported OSes; if you prefer to defer, create
a tracked follow-up issue and replace the hardcoded return with a clear TODO
referencing providePackageManager and Dnf to ensure this is not forgotten.
In `@server/internal/orchestrator/systemd/unit.go`:
- Around line 84-121: The Create method may leave the unit file on disk when any
of client.LinkUnit, Reload, EnableUnit, or RestartUnit fails; modify
UnitResource.Create to perform rollback on partial failure: after writing and
closing f, if any of LinkUnit/Reload/EnableUnit/RestartUnit returns an error,
attempt to undo by (a) calling a reverse client operation if available (e.g.,
client.UnlinkUnit(ctx, path) or client.DisableUnit/StopUnit), (b) calling
client.Reload(ctx) again to apply the undo if applicable, and (c) removing the
on-disk file with os.Remove(path) (ensure f is closed before removal); return
the original operation error (or wrap it) and suppress/aggregate any cleanup
errors. Locate this logic around the calls to LinkUnit, Reload, EnableUnit,
RestartUnit in UnitResource.Create and add a small rollback helper or inline
cleanup that uses the symbols client, path, and the error returned to perform
the undo.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4f75273a-59c4-4b48-b12b-634c6c4d1151
⛔ Files ignored due to path filters (8)
api/apiv1/gen/control_plane/service.gois excluded by!**/gen/**api/apiv1/gen/http/control_plane/client/types.gois excluded by!**/gen/**api/apiv1/gen/http/control_plane/server/types.gois excluded by!**/gen/**api/apiv1/gen/http/openapi.jsonis excluded by!**/gen/**api/apiv1/gen/http/openapi.yamlis excluded by!**/gen/**api/apiv1/gen/http/openapi3.jsonis excluded by!**/gen/**api/apiv1/gen/http/openapi3.yamlis excluded by!**/gen/**go.sumis excluded by!**/*.sum
📒 Files selected for processing (74)
.gitignoreMakefileNOTICE.txtapi/apiv1/design/database.gochanges/unreleased/Added-20260309-165318.yamldocs/development/running-locally.mde2e/backup_restore_test.goe2e/cancel_task_test.goe2e/custom_db_create_test.goe2e/db_create_test.goe2e/db_update_add_node_test.goe2e/failover_test.goe2e/fixture_test.goe2e/fixtures/ansible.cfge2e/load_test.goe2e/minor_version_upgrade_test.goe2e/service_provisioning_test.goe2e/switchover_test.goe2e/whole_cluster_test.gogo.modhack/dev-env.zshhack/pgedge-cp-env.plugin.zshlima/Makefilelima/ansible.cfglima/circus.inilima/config.jsonlima/deploy.yamllima/inventory.yamllima/lima-template.yamllima/roles/generate_test_config/tasks/main.yamllima/roles/generate_test_config/templates/test_config.yaml.tmpllima/roles/install_prerequisites/tasks/main.yamllima/roles/install_prerequisites/vars/main.yamllima/roles/start_vms/tasks/main.yamllima/roles/stop_dbs/tasks/main.yamllima/roles/teardown_vms/tasks/main.yamllima/run.shlima/stop-dbs.yamllima/teardown.yamllima/vars.yamlserver/cmd/root.goserver/internal/api/apiv1/convert.goserver/internal/api/apiv1/post_init_handlers.goserver/internal/api/apiv1/validate.goserver/internal/api/apiv1/validate_test.goserver/internal/app/app.goserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_resource.goserver/internal/database/instance_store.goserver/internal/database/orchestrator.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/ds/versions.goserver/internal/monitor/instance_monitor.goserver/internal/orchestrator/common/golden_test/TestPatroniConfigGenerator/enable_fast_basebackup.yamlserver/internal/orchestrator/common/golden_test/TestPatroniConfigGenerator/minimal_systemd.yamlserver/internal/orchestrator/provide.goserver/internal/orchestrator/swarm/orchestrator.goserver/internal/orchestrator/systemd/client.goserver/internal/orchestrator/systemd/dnf.goserver/internal/orchestrator/systemd/orchestrator.goserver/internal/orchestrator/systemd/packages.goserver/internal/orchestrator/systemd/patroni_config.goserver/internal/orchestrator/systemd/patroni_unit.goserver/internal/orchestrator/systemd/pgbackrest_restore.goserver/internal/orchestrator/systemd/provide.goserver/internal/orchestrator/systemd/resources.goserver/internal/orchestrator/systemd/unit.goserver/internal/postgres/hba/entry.goserver/internal/workflows/activities/create_pgbackrest_backup.goserver/internal/workflows/activities/update_db_state.goserver/internal/workflows/activities/update_planned_instance_states.goserver/internal/workflows/delete_database.go
40fd983 to
307f1e9
Compare
b56d8cd to
bf6b508
Compare
Adds an orchestrator implementation for SystemD. As of this commit, you'll need to install Postgres on the host before starting the Control Plane: ``` dnf install -y epel-release dnf dnf config-manager --set-enabled crb dnf update -y --allowerasing dnf install -y https://dnf.pgedge.com/reporpm/pgedge-release-latest.noarch.rpm dnf install -y \ pgedge-postgresql18 \ pgedge-spock50_18 \ pgedge-snowflake_18 \ pgedge-lolor_18 \ pgedge-postgresql18-contrib \ pgedge-pgbackrest \ pgedge-python3-psycopg2 \ python3-pip pip install 'patroni[etcd,jsonlogger]==4.1.0' ``` PLAT-417
Adds a development environment for the SystemD orchestrator, called `dev-lima`. Just like the `compose` environment, this environment exposes all services and databases on your host machine.
Adds support for running E2Es against the `dev-lima` environment. PLAT-417
307f1e9 to
6a1d107
Compare
bf6b508 to
d41571c
Compare
Summary
Adds preliminary support for orchestrating pgEdge Enterprise Postgres databases using SystemD instead of Docker Swarm.
Testing
This PR includes a lima-based development environment. See the changes in
running-locally.mdfor complete instructions. If you're already set up to run the lima-based E2E fixtures, you can do:Notes for Reviewers
As of this PR, the control-plane server does not respond to upgrades or other package changes that happen after the server starts. If you install a minor version update, you will be blocked from updating existing databases until you:
postgres_versionin your database specThis will change in a subsequent PR.
PLAT-417