Skip to content

Migrate bootstrap/upgrade to unbounded agent library#153

Merged
bcho merged 32 commits intomainfrom
hbc/unbounded-agent
Apr 29, 2026
Merged

Migrate bootstrap/upgrade to unbounded agent library#153
bcho merged 32 commits intomainfrom
hbc/unbounded-agent

Conversation

@bcho
Copy link
Copy Markdown
Member

@bcho bcho commented Apr 24, 2026

Summary

Migrate the AKS FlexNode agent to use the github.com/Azure/unbounded agent library (v0.1.2) for bootstrap, upgrade, and drift remediation. This replaces the hand-rolled gRPC component API with the library's phases.Task / phases.Serial / phases.Parallel primitives, and completes a full migration from logrus to log/slog.

Changes

Dependency & library updates

  • Drop go-systemd/v22 (no longer imported)
  • Remove vendor directory; use module proxy

Architecture: gRPC components → phases.Task

  • Remove the gRPC action-hub component API (pkg/components/arc, components.go, wrappers.go, pkg/systemd)
  • Arc install/uninstall moved to pkg/arc as phases.Task implementations
  • NPD download and start moved to pkg/npd (installs into nspawn rootfs, not host)
  • CNI bridge config write moved to pkg/cni using utilio.WriteFile
  • AKS Machine registration moved to pkg/aksmachine
  • Daemon binary install moved to pkg/daemon
  • EnrichClusterConfig converted from phases.Task to a plain function (runs before goal-state resolution)

Bootstrapper simplification

  • Bootstrap uses a single phases.Serial call with parallel sub-groups:
    • Host prep: ConfigureOS, NFTables, DisableDocker, DisableSwap, HardenAPT, Arc install
    • Rootfs customisation: NPD download, daemon install, CNI config
  • Unbootstrap uses phases.Serial for StopNode + CleanupMachine, then best-effort Arc uninstall

Logging: logrus → log/slog

  • Full migration: all packages now use *slog.Logger instead of *logrus.Logger
  • pkg/logger: removed exported LogLevel type/consts, ValidLogLevels, ValidateLogLevel, context-key machinery, LogLevelHelpers. SetupLogger renamed to CreateLogger — returns *slog.Logger without mutating slog.Default()
  • Logger is explicitly passed everywhere; no context retrieval or slog.Default() fallbacks
  • Consistent logger variable naming (not log)
  • Dropped redundant "time" structured fields from daemon loop log calls (slog timestamps automatically)
  • Proper %w error wrapping throughout

Code cleanup

  • Remove dead code: RemoveDirectories, pkg/systemd
  • Remove unused buildARMClientOptions, unused params from buildExecCredential
  • Unexport parseLogLevel (only used within pkg/logger)
  • Add TODO comments for blue-green in-place upgrade (currently single machine slot kube1)

E2E & CI

  • E2E tests updated for nspawn-based log collection
  • Exec credential auth support for MSI/SP
  • Default bridge CNI config written into nspawn rootfs
  • Explicit permissions on release job

Key design decisions

  • Hard-coded single machine slot (kube1) with TODO for blue-green upgrade
  • NPD installs into nspawn rootfs; systemd managed via systemd-run --machine
  • CreateLogger is a pure factory — does not set global default
  • Interface-driven: depend on interfaces for testability (Arc, spec collector, node maintenance)

Closes #112 #113

Replace FlexNode-specific component executors (linux, cri, cni, kubebins,
kubelet, kubeadm) with shared tasks from github.com/Azure/unbounded.
Adopt nspawn-based kube1/kube2 alternating machine model.

- Add config adapter (ToAgentConfig) to map FlexNode config to AgentConfig
- Rewrite Bootstrapper to use phases.Serial/Parallel with shared library tasks
- Wrap Arc/NPD as phases.Task for integration with new orchestration
- Rewrite drift remediation upgrade path to use shared rootfs/nodestart tasks
- Update status collector for nspawn-aware command execution
- Delete 6 component directories now handled by shared library
- Exclude gosec from test files in golangci-lint config
The deleted component handlers (linux, cri, cni, kubebins, kubeadm) are no
longer registered in the action hub. Rewrite the kubeadm E2E test to use
the same config-based 'aks-flex-node agent' flow as the token/MSI tests.
MSI auth requires credential plugin support in the shared library which
is not yet implemented. Skip MSI join/unjoin/validate/smoke in E2E.

Add nspawn machine status and container kubelet log dumping to
_deploy_and_start_agent for better bootstrap debugging.
…nfig

CNI binaries are installed by the shared library bootstrap, but the CNI
conflist (10-unbounded.conflist) is written at runtime by the
unbounded-net-node DaemonSet which watches for podCIDR assignment. Until
this DaemonSet is deployed in the E2E cluster, nodes stay NetworkNotReady
and pods cannot be scheduled.

Node join/unjoin/rejoin all pass successfully.
bcho added 2 commits April 24, 2026 14:04
Update unbounded to f9e5ffb2b2ec which adds ExecCredential support in
the kubelet kubeconfig. Map FlexNode MSI and service principal configs
to exec credential plugins that invoke 'aks-flex-node token kubelogin'.

Always copy the aks-flex-node binary into the nspawn rootfs so it is
available inside the container for credential plugins and debugging.

Re-enable MSI node join/unjoin/validation in E2E tests.
machinectl shell doesn't produce output over non-TTY SSH sessions.
Use nsenter for journalctl and read CNI/containerd config directly
from /var/lib/machines/kube1/ rootfs.
…ests

The unbounded library installs CNI binaries but not a conflist. Without
one kubelet reports NetworkNotReady and pods cannot be scheduled.

Add a WriteCNIConfig task that writes the same 99-bridge.conf (embedded
via go:embed) that the old FlexNode CNI component used, into the nspawn
rootfs at /etc/cni/net.d/. Re-enable E2E smoke tests.
bcho added 2 commits April 24, 2026 15:39
Update github.com/Azure/unbounded to b8847fd8701b which adds CRI and
CNI version override fields to AgentConfig. Map FlexNode's containerd,
runc, and CNI plugin versions through the adapter; empty values fall
back to library defaults.

Also replace the now-internal utilexec package with a local copy in
pkg/utils/utilexec.
Workaround until the library default is updated.
Remove protobuf/gRPC indirection from bootstrap tasks. DownloadNPD,
StartNPD, and InstallArc now implement phases.Task directly using
config fields instead of marshalling through protobuf actions and the
in-memory gRPC hub. This eliminates the grpc.ClientConn dependency
from the Bootstrapper struct and simplifies the bootstrap path.

The drift package still passes conn through its signatures; that
cleanup is tracked separately.
Delete the entire components/ directory (protobuf definitions, gRPC hub,
in-memory server, Arc/NPD/aksmachine action implementations) and the
apply CLI subcommand. All bootstrap tasks now use native phases.Task
implementations in pkg/bootstrapper/wrappers.go.

EnsureMachine (AKS Machines API registration) is converted to a native
task reading from config.Config and commented out in the bootstrap flow
until the Machines API is available in all target environments.

Also removes grpc.ClientConn threading from the drift package and all
daemon loop functions in commands.go, and cleans up unused utilpb
package and go.mod dependencies.
bcho added 7 commits April 27, 2026 10:51
Download binary and config into machineDir so they appear inside the
nspawn container. Start the systemd unit via systemd-run --machine
instead of the host D-Bus, matching the pattern used by containerd
and kubelet in the unbounded agent library.

Removes dependency on pkg/systemd.Manager for NPD.
Extract CNI bridge config task and its embedded asset from
pkg/bootstrapper/wrappers.go into a dedicated pkg/cni package,
following the same pattern as pkg/arc and pkg/npd.

Also removes stale pkg/bootstrapper/assets/node-problem-detector.service
left over from the earlier NPD move.
utilio.WriteFile handles directory creation and atomic writes, so
remove the manual MkdirAll and content equality check.
Extract AKS Machines API registration task and helpers from
pkg/bootstrapper/wrappers.go into a dedicated pkg/aksmachine package,
continuing the pattern of one package per Azure-specific component.
- Convert EnrichClusterConfig from phases.Task to a plain function
  called directly before goal state resolution.
- Remove logrus bridge (logger.go) — no longer needed.
- Merge bootstrap task list into single phases.Serial call.
- Run Arc install in parallel with other host preparation steps.
- Use phases.Serial for unbootstrap critical path (StopNode,
  CleanupMachine) with best-effort Arc uninstall afterwards.
- Use b.cfg/b.logger directly instead of local aliases.
bcho added 5 commits April 27, 2026 13:01
Extract InstallBinary task into pkg/daemon/install.go. Run NPD
download, daemon binary install, and CNI config write in parallel
after rootfs provisioning. Delete empty components.go and now-empty
wrappers.go.
Document that bootstrapper and drift remediation currently hard-code
a single machine slot (kube1) and will need to manage two machine
names once blue-green upgrades are implemented.
RemoveDirectories in pkg/utils had no callers. pkg/systemd had no
importers after NPD was moved to use systemd-run --machine. go mod
tidy drops the now-unused go-systemd/v22 dependency.
@bcho bcho marked this pull request as ready for review April 27, 2026 21:32
Copilot AI review requested due to automatic review settings April 27, 2026 21:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Comment thread pkg/aksmachine/ensure.go Dismissed
Comment thread pkg/aksmachine/ensure.go Dismissed
Comment thread pkg/aksmachine/ensure.go Dismissed
Comment thread pkg/arc/arc_installer.go Dismissed
Comment thread pkg/arc/arc_installer.go Dismissed
Comment thread pkg/daemon/install.go Dismissed
Comment thread pkg/daemon/install.go Dismissed
Comment thread pkg/daemon/install.go Dismissed
Comment thread pkg/npd/start.go Dismissed
Comment thread pkg/utils/utilexec/exec.go Dismissed
@bcho bcho merged commit 96145aa into main Apr 29, 2026
8 of 9 checks passed
@bcho bcho deleted the hbc/unbounded-agent branch April 29, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Install crictl on the host

4 participants