Skip to content

Conversation

@nilo19
Copy link
Contributor

@nilo19 nilo19 commented Nov 12, 2025

What type of PR is this?

/kind design

What this PR does / why we need it:

Add design doc for lb admin state control.

Which issue(s) this PR fixes:

Fixes #
Related: #9633

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/design Categorizes issue or PR as related to design. labels Nov 12, 2025
@netlify
Copy link

netlify bot commented Nov 12, 2025

Deploy Preview for kubernetes-sigs-cloud-provide-azure ready!

Name Link
🔨 Latest commit 787b7e3
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cloud-provide-azure/deploys/69147799ddb1c60008987dc2
😎 Deploy Preview https://deploy-preview-9634--kubernetes-sigs-cloud-provide-azure.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 12, 2025

### `PreemptScheduled` Event

1. The events informer surfaces a `PreemptScheduled` warning for node `N`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the event read from IMDS or API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes api.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about events of Kubernetes nodes? Is there an existing mechanism to pass preempt events from IMDS to events of nodes? https://learn.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to use imds instead of watching kubernetes events? cloud-node-manager is a light weight component for node init, designed to replace node controller in ccm, maybe we don't want to put many features into it.

### High-Level Flow

1. Kubernetes adds the `node.kubernetes.io/out-of-service` taint to a node or emits a `PreemptScheduled` event indicating an imminent Spot eviction.
2. For Spot events, cloud-provider-azure patches the node with `cloudprovider.azure.microsoft.com/draining=spot-eviction` (one-time) so the signal persists beyond controller restarts; manual application of this taint is treated the same way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cloudprovider.azure.microsoft.com/draining=spot-eviction"

is it a label or an annotation? why not asking the user to patch node with this directly instead of patching add taint or emit event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A label. The user can patch this manually. But if we detect a spot eviction events, we would patch the node proactively and trigger the logic. We can trigger the loggic directly without patching the node, but that is not robust as the controller manager could restart during the middle, making the signal lost.


### High-Level Flow

1. Kubernetes adds the `node.kubernetes.io/out-of-service` taint to a node or emits a `PreemptScheduled` event indicating an imminent Spot eviction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like "node.kubernetes.io/out-of-service" is a well-known taint
https://kubernetes.io/docs/reference/labels-annotations-taints/#node-kubernetes-io-out-of-service

However, it's not added by k8s native controllers, but manually by user. What's the suggested effect do you propose here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect the user/caller to drain the pods on the node right after tainting the node? Since it requires to call SLB control plane to set the AdminState=Down, it might fail or take long time. I think it's better to have the controller to signal back on the node indicating that the AdminState=Down is set for the node, so that the caller can continue to proceed to drain the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can manually taint the node. Besides, we taint the node when it's shutdown, and removes the taint when the node goes back.
I am a bit worried when the node quickly jumps between shutdown/ready state. In this case we may generate a lot of adminstate down requests.
Let me re-consider the logic here. How do you think we can prevent this case? @robbiezhang

Also, for the callback signal you mentioned, we can patch the node again after setting adminstate down, and clear this patch after setting back to None. Let me see how to refine the document.


- Introduce an `AdminStateManager` interface responsible for:
- Translating a Kubernetes node into all relevant backend pool IDs (leveraging existing helpers such as `reconcileBackendPoolHosts`, `getBackendPoolIDsForService`, and the multiple-SLB bookkeeping).
- Calling the Azure SDK (`LoadBalancerBackendAddressPoolsClient`) to update the specific backend address with `AdminState`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the throttling limit for this API? does it allow concurrent updates on different backend addresses? think of the scenario that the upgrader taint multiple nodes at the same time, what's the latency do we expect to set the AdminState for all nodes? Is it sequential, or in parallel, or in batch?

1. Fetch the node object from the informer cache and compute desired admin state (`Down` when tainted, `None` when cleared).
2. Resolve backend entries using existing pool-mapping helpers.
3. Issue Azure SDK calls immediately (within the same event) so the time between taint observation and traffic cutover is bounded only by ARM latency.
4. Record success/failure metrics and retry via the queue if needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the controller should emit event on failure so that we/cx can observe the events.

the controller should also add latency as metrics so that we can evaluate the effectiveness. This feature is useless if the latency is longer than 10s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deninately

- Restore admin state to `None` when a node returns to service so steady-state behavior is unaffected.
- Avoid additional user-facing APIs, annotations, or CLI switches.

## Non-Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a non-goal is optimizing LB backend pool membership when an underlying node VM goes offline (hard down)? I assume the existing Azure LB backend pool polling (LB regularly polls backend pool IP addresses) is sufficient to automatically evict downed VMs from the backend pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean lb health probe by "regularly polls"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants