components/linux: add ConfigureStaticRoutes + CheckRouteOverlap#151
components/linux: add ConfigureStaticRoutes + CheckRouteOverlap#151chokevin wants to merge 12 commits intoAzure:mainfrom
Conversation
Add a new component that installs one or more static IPv4 routes via a systemd oneshot unit that runs 'ip route replace' before kubelet starts. Motivation: some Azure GPU SKUs (e.g. Standard_ND96isr_H200_v5) have an InfiniBand fabric that installs connected /16 routes for 172.16.0.0/16 on every ib interface. When the destination AKS cluster is also addressed from 172.16.0.0/16, the kernel routes traffic to cluster-internal services (konnectivity, spegel) out an IB interface instead of eth0, breaking cross-node control-plane traffic. With this component, the operator can install more-specific /24 routes for the AKS subnets via eth0; /24 wins longest-prefix-match against the IB /16 without disturbing peer-to-peer IB. Design: - Proto: ConfigureStaticRoutes with repeated StaticRoute (destination CIDR, optional gateway, optional dev, optional metric). - Handler writes /etc/aks-flex-node/static-routes.sh and installs static-routes.service (Before=kubelet.service, After=network- online.target). Script is idempotent (ip route replace). - When Gateway is empty the script resolves the default gateway on <dev> (or eth0) at boot time — works across subnets without operator knowing the .1 of each. - Inputs are validated (CIDR / IP / interface name allowlist) to guard against shell injection when the spec is untrusted. Tests cover empty spec no-op, explicit vs auto-resolved gateway, metric, determinism, and rejection of invalid inputs.
There was a problem hiding this comment.
Pull request overview
Adds a new Linux component to install operator-specified static IPv4 routes via a systemd oneshot that runs before kubelet, intended to override provider-installed routes (e.g., IB-connected /16s) that can hijack cluster traffic.
Changes:
- Introduces
ConfigureStaticRoutesproto/API and registers the new action. - Implements a handler that renders
/etc/aks-flex-node/static-routes.shand installs/enablesstatic-routes.service. - Adds unit and tests for script rendering, validation, and determinism.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| components/linux/v20260301/exports.go | Registers the new ConfigureStaticRoutes action. |
| components/linux/v20260301/configure_static_routes.go | Implements script rendering, validation, and systemd unit installation/enablement. |
| components/linux/v20260301/configure_static_routes_test.go | Adds tests for script output/validation/determinism and iface-name sanitization. |
| components/linux/v20260301/assets/static-routes.service | Adds the oneshot systemd unit that runs the generated script before kubelet. |
| components/linux/action.proto | Adds ConfigureStaticRoutes + StaticRoute schema and docs. |
| components/linux/action.pb.go | Regenerated protobuf output for the new messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…gateway, IPv4-only - writeScriptIfChanged: byte-compare before writing; plumb change through to EnsureUnitRunning restart flag so a script-only change reruns the oneshot (RemainAfterExit=yes would otherwise skip the rerun). - resolve_default_gw: retry 30x/1s, then exit 1 rather than silently skipping the route (network-online.target doesn't guarantee DHCP has installed the default route on multi-NIC H200 SKUs). - Reject IPv6 destinations/gateways at render time; proto comments now say 'must be IPv4' explicitly and 'dev defaults to eth0'. - Tests: IPv6 rejection cases, fail-hard script shape, writeScriptIfChanged unit test (creates/noops-same/updates-different, gated by t.TempDir).
|
Addressed rubber-duck critique in
Tests added for all three (IPv6 rejection, fail-hard script shape, |
- When dev is empty, resolve the outbound interface from the IPv4 default route at boot time (works with predictable names like ens3/enp0s6), not a hardcoded eth0. New resolve_default_dev bash helper, 30x1s retry. - Use 'ip -4 route replace' to make the IPv4 intent explicit at the command level, matching the Is4() validation in Go and the proto docs. - Proto: dev comment now documents the dynamic-resolution behavior.
|
Addressed Copilot review in Fixed in this commit:
Already addressed in the prior commit
Not changing:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
FYI on the red E2E check — it's not a test failure from this PR's code. The All other checks are green: Build, Lint, Test (unit), Code Quality, Security Scan, gosec, Dependency Review. Happy to re-home the branch inside Azure/AKSFlexNode if that's the preferred path for getting the E2E job to run. |
Pairs with ConfigureStaticRoutes: detects (and optionally fails-fast on)
the situation that motivated static routes in the first place — a node
where an expected cluster CIDR routes out a non-default interface
(classic case: H200 IB driver shadowing a customer VNet CIDR with a
connected /16 on ib0).
- New action + proto messages (CheckRouteOverlap, spec with
expected_cidrs and Mode {WARN, STRICT}).
- Systemd oneshot ordered After=static-routes.service Before=kubelet,
so any mitigation has already been applied when the check runs.
- WARN mode logs and writes /run/aks-flex-node/route-overlap.detected
but lets kubelet start. STRICT mode exit 1's; with
RequiredBy=kubelet.service the node will not become Ready.
- For each expected_cidrs entry, the script asks the kernel directly via
'ip -4 route get <probe>' which interface that address would go out,
and compares it to the IPv4 default route's interface. Any mismatch
is logged with a concrete remediation hint pointing at staticRoutes.
|
Added What it does: oneshot ordered Modes:
Why pair them in one PR: Test/lint clean. |
|
Follow-up update: ConfigureStaticRoutes now has an explicit opt-in gate in the spec.
This makes static-route injection deliberate and prevents accidental automatic application. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Addressed latest Copilot findings in commit 06abb17: added nil-spec validation in CheckRouteOverlap, moved RequiredBy to [Install] in check-route-overlap.service, changed ConfigureStaticRoutes empty-routes behavior to stop/disable static-routes.service, and added tests for new guard paths. |
|
Follow-up cleanup in 79973ef: simplified both generated route scripts to data-driven loop forms (heredoc entries + single processing loop) so we avoid per-route bash block expansion. Behavior is unchanged; tests updated and passing. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,16 @@ | |||
| [Unit] | |||
| Description=AKSFlexNode IPv4 route overlap pre-flight check | |||
| DefaultDependencies=no | |||
There was a problem hiding this comment.
DefaultDependencies=no is unusual here and can drop important implicit ordering/dependencies (e.g., local-fs/journald/basic targets). Since the unit already has explicit After=network-online.target and Before=kubelet.service, consider removing DefaultDependencies=no unless there's a demonstrated boot-order issue that requires it (or add the missing explicit deps that DefaultDependencies would otherwise provide).
| DefaultDependencies=no |
| if spec == nil { | ||
| return fmt.Errorf("spec is required") | ||
| } | ||
| return nil |
There was a problem hiding this comment.
validateCheckRouteOverlapSpec only checks for non-nil spec. Since protobuf enums can carry unknown numeric values, it would be safer to validate spec.mode is one of {MODE_UNSPECIFIED, WARN, STRICT} (or explicitly handle unknown values) so a bad value doesn't silently fall back to WARN behavior in renderCheckRouteOverlapScript.
| return nil | |
| switch int32(spec.GetMode()) { | |
| case 0, 1, 2: | |
| return nil | |
| default: | |
| return fmt.Errorf("invalid mode %d: must be one of MODE_UNSPECIFIED(0), WARN(1), or STRICT(2)", spec.GetMode()) | |
| } |
| } | ||
|
|
||
| var b strings.Builder | ||
| fmt.Fprintf(&b, "#!/bin/bash\n") |
There was a problem hiding this comment.
prefer to use embed bash script instead of concating string in go code
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DEFAULT_DEV=$(ip -4 route show default 2>/dev/null | awk '/^default via/ {for (i=1;i<=NF;i++) if ($i=="dev") {print $(i+1); exit}}') | ||
| if [ -z "$DEFAULT_DEV" ]; then | ||
| echo "check-route-overlap: no IPv4 default route; cannot determine outbound interface" >&2 | ||
| echo "no-default-route" > /run/aks-flex-node/route-overlap.detected | ||
| exit {{ .FailExit }} | ||
| fi |
There was a problem hiding this comment.
DEFAULT_DEV extraction only matches routes that contain "default via". On systems where the default route is represented as "default dev " (no gateway), this will incorrectly treat the default route as missing and can fail the unit (or block kubelet in STRICT). Recommend matching any "^default" line and then extracting the "dev" token.
| // Probe with first usable address (network address + 1). Works for | ||
| // any prefix shorter than /32; for /32 we just probe the address. | ||
| probe := prefix.Addr() | ||
| if prefix.Bits() < 32 { |
There was a problem hiding this comment.
Probe selection can escape the CIDR when the input prefix isn’t network-aligned. netip.ParsePrefix("10.0.0.255/24") is valid, but prefix.Addr().Next() becomes 10.0.1.0 (outside /24), causing false negatives/positives for overlap detection. Consider basing the probe on prefix.Masked().Addr() (and/or verifying prefix.Contains(probe) after Next()).
| // Probe with first usable address (network address + 1). Works for | |
| // any prefix shorter than /32; for /32 we just probe the address. | |
| probe := prefix.Addr() | |
| if prefix.Bits() < 32 { | |
| maskedPrefix := prefix.Masked() | |
| // Probe with first usable address (network address + 1). Works for | |
| // any prefix shorter than /32; for /32 we just probe the network | |
| // address itself. Use the masked/network form of the prefix so the | |
| // probe always stays within the CIDR even when the input is not | |
| // network-aligned. | |
| probe := maskedPrefix.Addr() | |
| if maskedPrefix.Bits() < 32 { |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cat >&2 <<'EOF' | ||
| Action: configure spec.staticRoutes on the NodeClass with more-specific | ||
| routes for the affected CIDRs, or rebuild the cluster on a non-overlapping | ||
| VNet CIDR. See AKSFlexNode/components/linux/v20260301/configure_static_routes.go. |
There was a problem hiding this comment.
The operator guidance in the overlap failure text references a repository source path (AKSFlexNode/components/.../configure_static_routes.go), which won’t exist on the node at runtime. Recommend replacing this with a user-consumable reference (e.g., a docs URL, KB link, or a brief inline summary of the required NodeClass fields) so the message remains actionable from journalctl output alone.
| VNet CIDR. See AKSFlexNode/components/linux/v20260301/configure_static_routes.go. | |
| VNet CIDR. For each affected CIDR, add a spec.staticRoutes entry that | |
| specifies the destination CIDR and the next-hop/default gateway so traffic | |
| uses the node's normal outbound interface. |
| ACTUAL=$(ip -4 route get "$PROBE" 2>/dev/null | awk '{for (i=1;i<=NF;i++) if ($i=="dev") {print $(i+1); exit}}') | ||
| if [ -z "$ACTUAL" ]; then ACTUAL="<no-route>"; fi | ||
| if [ "$ACTUAL" != "$DEFAULT_DEV" ]; then | ||
| msg="OVERLAP: expected CIDR $CIDR (probe $PROBE) routes via $ACTUAL, expected $DEFAULT_DEV" |
There was a problem hiding this comment.
The script labels missing routes as an "OVERLAP". When ip -4 route get yields no dev token, ACTUAL is set to <no-route> but the emitted message still starts with OVERLAP, which is inaccurate and can mislead debugging. Consider special-casing the <no-route> case to emit a different message (and/or different marker content) that clearly indicates there is no route rather than an interface mismatch.
| msg="OVERLAP: expected CIDR $CIDR (probe $PROBE) routes via $ACTUAL, expected $DEFAULT_DEV" | |
| if [ "$ACTUAL" = "<no-route>" ]; then | |
| msg="NO-ROUTE: expected CIDR $CIDR (probe $PROBE) has no IPv4 route; expected via $DEFAULT_DEV" | |
| else | |
| msg="OVERLAP: expected CIDR $CIDR (probe $PROBE) routes via $ACTUAL, expected $DEFAULT_DEV" | |
| fi |
Problem
Some Azure GPU SKUs (notably
Standard_ND96isr_H200_v5) ship images where the InfiniBand fabric driver installs connected/16routes for172.16.0.0/16on every IB interface (ib0..ib7). When a customer's cluster VNet also lives in172.16/16, every packet from a pod to the API server / kube-dns / cross-node pods hits the IB connected route (longest-prefix-match wins, and connected scope wins over gatewayed default), gets dumped on the IB fabric, and dies — IB only knows IB peer GUIDs, not arbitrary IPs.Symptom: kubelet looks healthy, the node is
Ready, but pods scheduled on it can't reach anything. Hours of debugging to discover the real cause is in the kernel route table.This PR adds two paired actions to AKSFlexNode that turn this into a loud, actionable boot-time error and give operators a way to fix it.
What's in this PR
1.
ConfigureStaticRoutes— the mitigationBootstrap-time systemd oneshot that installs more-specific IPv4 routes via the IPv4 default-route interface, ordered
Before=kubelet.serviceso the route table is correct on the first kubelet registration.For the H200 case: install
/24routes for the affected cluster CIDRs viaeth0. Longest-prefix-match means the/24s win for cluster traffic; the IB/16still works for IB-peer traffic onib0..ib7. No conflict.Notable design choices:
staticRoutes:ever get the unit installed. Zero impact on non-IB SKUs.ip -4 route replace: explicit IPv4 at the command level matches the proto's IPv4-only contract and theIs4()validation in Go.devis empty the script resolves the outbound interface from the IPv4 default route at boot (handles classiceth0and predictable names likeens3/enp0s6).exit 1s rather than silently skipping the route. WithRequiredBy=kubelet.service, this means kubelet won't start with broken routing — better than booting half-working.writeScriptIfChangedbyte-compares before writing, and the change signal is plumbed intoEnsureUnitRunning's restart flag. A route-only update reruns the oneshot even though the unit file hasn't changed (RemainAfterExit=yeswould otherwise mark the unit "active" and skip the rerun).prefix.Addr().Is4()+gwAddr.Is4(). The proto and impl agree.2.
CheckRouteOverlap— the detectionThe companion. Without it,
ConfigureStaticRoutesis a knob you only know to turn after spending hours debugging an unreachable API server.Boot-time oneshot ordered
After=network-online.target static-routes.serviceandBefore=kubelet.service RequiredBy=kubelet.service. For eachexpected_cidrsentry (typically pod CIDR + service CIDR + API server endpoint, populated by the controller), the script asks the kernel directly viaip -4 route get <probe>which interface that address would actually go out, and compares it to the IPv4 default route's interface. Any mismatch is an overlap.Two modes:
WARN: log the overlap, write/run/aks-flex-node/route-overlap.detected, let kubelet start. Use this when introducing detection on an existing cluster you don't want to break.STRICT: same logging +exit 1. WithRequiredBy=kubelet.service, kubelet does not start, the node never becomes Ready, and the cluster operator sees the failure immediately inkubectl get nodes. Recommended for production: a misrouted node is worse than a node that won't join.The error message is the documentation:
The check runs
After=static-routes.service, so ifConfigureStaticRoutesalready mitigated the overlap, the kernel's view at check time reflects that and the check passes.Why pair them in one PR
ConfigureStaticRoutesalone is a footgun in reverse — operators only learn they need it after pods mysteriously can't reach the API server. Together, the two actions turn the H200/VNet-overlap class of problem from "silent failure that takes hours to root-cause" into "boot-time error that names the bad CIDR and points at the fix."Consumer
Both actions are wired into the controller in a follow-up
aks-flexPR:ConfigureStaticRoutesis populated from a newspec.staticRoutesfield on the NodeClass.CheckRouteOverlapis populated automatically from the cluster's pod CIDR + service CIDR + API server, defaulting toWARN(configurable toSTRICTper NodeClass).Verification
go build ./...clean.go test ./components/linux/...clean (new unit tests cover render correctness, mode-dependent exit codes, IPv4 validation, IPv6 rejection, deterministic output,writeScriptIfChangedidempotency).golangci-lint run ./components/linux/v20260301/0 issues.Note on E2E check
The
E2E Tests (MSI + Token)check is failing because the workflow readssecrets.E2E_RESOURCE_GROUPetc., and GitHub doesn't expose repository secrets to PRs from forks. First failure line:common.sh:129: E2E_RESOURCE_GROUP: Set E2E_RESOURCE_GROUP in environment or .env. Happy to re-home the branch inside Azure/AKSFlexNode if a maintainer wants the E2E job to run.