Migrate bootstrap/upgrade to unbounded agent library#153
Merged
Conversation
Replace FlexNode-specific component executors (linux, cri, cni, kubebins, kubelet, kubeadm) with shared tasks from github.com/Azure/unbounded. Adopt nspawn-based kube1/kube2 alternating machine model. - Add config adapter (ToAgentConfig) to map FlexNode config to AgentConfig - Rewrite Bootstrapper to use phases.Serial/Parallel with shared library tasks - Wrap Arc/NPD as phases.Task for integration with new orchestration - Rewrite drift remediation upgrade path to use shared rootfs/nodestart tasks - Update status collector for nspawn-aware command execution - Delete 6 component directories now handled by shared library - Exclude gosec from test files in golangci-lint config
The deleted component handlers (linux, cri, cni, kubebins, kubeadm) are no longer registered in the action hub. Rewrite the kubeadm E2E test to use the same config-based 'aks-flex-node agent' flow as the token/MSI tests.
MSI auth requires credential plugin support in the shared library which is not yet implemented. Skip MSI join/unjoin/validate/smoke in E2E. Add nspawn machine status and container kubelet log dumping to _deploy_and_start_agent for better bootstrap debugging.
…nfig CNI binaries are installed by the shared library bootstrap, but the CNI conflist (10-unbounded.conflist) is written at runtime by the unbounded-net-node DaemonSet which watches for podCIDR assignment. Until this DaemonSet is deployed in the E2E cluster, nodes stay NetworkNotReady and pods cannot be scheduled. Node join/unjoin/rejoin all pass successfully.
Update unbounded to f9e5ffb2b2ec which adds ExecCredential support in the kubelet kubeconfig. Map FlexNode MSI and service principal configs to exec credential plugins that invoke 'aks-flex-node token kubelogin'. Always copy the aks-flex-node binary into the nspawn rootfs so it is available inside the container for credential plugins and debugging. Re-enable MSI node join/unjoin/validation in E2E tests.
machinectl shell doesn't produce output over non-TTY SSH sessions. Use nsenter for journalctl and read CNI/containerd config directly from /var/lib/machines/kube1/ rootfs.
…ests The unbounded library installs CNI binaries but not a conflist. Without one kubelet reports NetworkNotReady and pods cannot be scheduled. Add a WriteCNIConfig task that writes the same 99-bridge.conf (embedded via go:embed) that the old FlexNode CNI component used, into the nspawn rootfs at /etc/cni/net.d/. Re-enable E2E smoke tests.
Update github.com/Azure/unbounded to b8847fd8701b which adds CRI and CNI version override fields to AgentConfig. Map FlexNode's containerd, runc, and CNI plugin versions through the adapter; empty values fall back to library defaults. Also replace the now-internal utilexec package with a local copy in pkg/utils/utilexec.
Workaround until the library default is updated.
Remove protobuf/gRPC indirection from bootstrap tasks. DownloadNPD, StartNPD, and InstallArc now implement phases.Task directly using config fields instead of marshalling through protobuf actions and the in-memory gRPC hub. This eliminates the grpc.ClientConn dependency from the Bootstrapper struct and simplifies the bootstrap path. The drift package still passes conn through its signatures; that cleanup is tracked separately.
Delete the entire components/ directory (protobuf definitions, gRPC hub, in-memory server, Arc/NPD/aksmachine action implementations) and the apply CLI subcommand. All bootstrap tasks now use native phases.Task implementations in pkg/bootstrapper/wrappers.go. EnsureMachine (AKS Machines API registration) is converted to a native task reading from config.Config and commented out in the bootstrap flow until the Machines API is available in all target environments. Also removes grpc.ClientConn threading from the drift package and all daemon loop functions in commands.go, and cleans up unused utilpb package and go.mod dependencies.
Download binary and config into machineDir so they appear inside the nspawn container. Start the systemd unit via systemd-run --machine instead of the host D-Bus, matching the pattern used by containerd and kubelet in the unbounded agent library. Removes dependency on pkg/systemd.Manager for NPD.
Extract CNI bridge config task and its embedded asset from pkg/bootstrapper/wrappers.go into a dedicated pkg/cni package, following the same pattern as pkg/arc and pkg/npd. Also removes stale pkg/bootstrapper/assets/node-problem-detector.service left over from the earlier NPD move.
utilio.WriteFile handles directory creation and atomic writes, so remove the manual MkdirAll and content equality check.
Extract AKS Machines API registration task and helpers from pkg/bootstrapper/wrappers.go into a dedicated pkg/aksmachine package, continuing the pattern of one package per Azure-specific component.
- Convert EnrichClusterConfig from phases.Task to a plain function called directly before goal state resolution. - Remove logrus bridge (logger.go) — no longer needed. - Merge bootstrap task list into single phases.Serial call. - Run Arc install in parallel with other host preparation steps. - Use phases.Serial for unbootstrap critical path (StopNode, CleanupMachine) with best-effort Arc uninstall afterwards. - Use b.cfg/b.logger directly instead of local aliases.
Extract InstallBinary task into pkg/daemon/install.go. Run NPD download, daemon binary install, and CNI config write in parallel after rootfs provisioning. Delete empty components.go and now-empty wrappers.go.
Document that bootstrapper and drift remediation currently hard-code a single machine slot (kube1) and will need to manage two machine names once blue-green upgrades are implemented.
RemoveDirectories in pkg/utils had no callers. pkg/systemd had no importers after NPD was moved to use systemd-run --machine. go mod tidy drops the now-unused go-systemd/v22 dependency.
…s, wrap errors with %w
wenxuan0923
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate the AKS FlexNode agent to use the
github.com/Azure/unboundedagent library (v0.1.2) for bootstrap, upgrade, and drift remediation. This replaces the hand-rolled gRPC component API with the library'sphases.Task/phases.Serial/phases.Parallelprimitives, and completes a full migration from logrus tolog/slog.Changes
Dependency & library updates
go-systemd/v22(no longer imported)Architecture: gRPC components → phases.Task
pkg/components/arc,components.go,wrappers.go,pkg/systemd)pkg/arcasphases.Taskimplementationspkg/npd(installs into nspawn rootfs, not host)pkg/cniusingutilio.WriteFilepkg/aksmachinepkg/daemonEnrichClusterConfigconverted fromphases.Taskto a plain function (runs before goal-state resolution)Bootstrapper simplification
phases.Serialcall with parallel sub-groups:phases.Serialfor StopNode + CleanupMachine, then best-effort Arc uninstallLogging: logrus → log/slog
*slog.Loggerinstead of*logrus.Loggerpkg/logger: removed exportedLogLeveltype/consts,ValidLogLevels,ValidateLogLevel, context-key machinery,LogLevelHelpers.SetupLoggerrenamed toCreateLogger— returns*slog.Loggerwithout mutatingslog.Default()slog.Default()fallbacksloggervariable naming (notlog)"time"structured fields from daemon loop log calls (slog timestamps automatically)%werror wrapping throughoutCode cleanup
RemoveDirectories,pkg/systemdbuildARMClientOptions, unused params frombuildExecCredentialparseLogLevel(only used withinpkg/logger)kube1)E2E & CI
Key design decisions
kube1) with TODO for blue-green upgradesystemd-run --machineCreateLoggeris a pure factory — does not set global defaultCloses #112 #113