-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Issue: HasHighlyAvailableControlPlane incorrectly uses filtered instance groups
/kind bug
1. What kops version are you running?
kops version 1.33.0
2. What Kubernetes version are you running?
Kubernetes v1.33.0
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
On a cluster with multiple control plane nodes (HA setup), run:
kops update cluster <cluster-name> --instance-group nodes --yesWhere nodes is a worker instance group (not a control plane instance group).
5. What happened after the commands executed?
When updating a specific non-control-plane instance group using --instance-group or --instance-group-roles filters, the cluster-wide addons like:
aws-load-balancer-controllernode-termination-handlercluster-autoscaleraws-ebs-csi-driver
incorrectly have their replica count reduced from 2 to 1, even though the cluster still has multiple control plane nodes.
6. What did you expect to happen?
The replica count for cluster-wide controllers should remain at 2 (or the correct value based on the actual number of control plane nodes in the cluster), regardless of which instance group is being updated via filters.
7. Root Cause Analysis
The HasHighlyAvailableControlPlane() function in upup/pkg/fi/cloudup/template_functions.go uses tf.InstanceGroups, which is a filtered list based on --instance-group or --instance-group-roles flags.
When updating only worker nodes, tf.InstanceGroups contains no control plane nodes, causing the function to incorrectly return false even though the cluster has multiple control plane nodes.
This function is used by ControlPlaneControllerReplicas() which determines the replica count for various controllers that should run at cluster level, not at the filtered instance group level.
Code Location:
- Bug:
upup/pkg/fi/cloudup/template_functions.go:503 - Context:
pkg/model/context.go:52-59(definition ofInstanceGroupsvsAllInstanceGroups)
8. Proposed Solution
Change HasHighlyAvailableControlPlane() to use tf.AllInstanceGroups instead of tf.InstanceGroups.
This is appropriate because:
- HA status is a cluster-wide property, not specific to filtered instance groups
- Other cluster-wide operations already use
AllInstanceGroups(e.g., IAM configuration on line 720) - The comment in
context.goexplicitly states: "we sometimes need the full list for example when configuring cluster-wide IAM"
The fix includes:
- Code change to use
AllInstanceGroups - Comprehensive test coverage including a regression test for this specific scenario
9. Impact
This bug can cause:
- Reduced availability of critical controllers during instance group updates
- Unexpected downscaling of cluster-wide services
- Potential service disruptions if the single remaining replica experiences issues
The issue only manifests when using --instance-group or --instance-group-roles filters with kops update cluster.