Skip to content

Conversation

@zhanggbj
Copy link
Contributor

@zhanggbj zhanggbj commented Oct 28, 2025

What this PR does / why we need it:
Support Node Auto Placement and Node AF/AAF

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

zhanggbj and others added 11 commits October 22, 2025 14:19
- Bump VMOP including Node AF/AAF support
- Add NodeAutoPlacement Feature Gate

(cherry picked from commit 700c8ae)
Removes the extra cases for VMG creation, such that VMG is created for:
1. Multiple zones, multiple MDs with no failureDomain
2. Multiple zones, multiple MDs with failureDomain
3. Single zone, existing cluster with no failureDomain MDs

Signed-off-by: Sagar Muchhal <[email protected]>
- Updates VMOP API dependency

Misc VMG fixes
- Use namingStrategy to calculate VM names
- Use MachineDeployment names for VMG placement label
- Includes all machinedeployments to generate node-pool -> zone
  mapping

Fixes VMG webhook validation error
- Adds cluster-name label to Af/AAF spec
- re-adds zone topology key back to anti-aff spec

Signed-off-by: Sagar Muchhal <[email protected]>
…gs#71)

* Refine VMG controller when generate per-MD zone labels

- Skip legacy already-placed VM which do not have placement info
- Skip VM which do not have zone info

* Apply suggestions from code review

---------

Co-authored-by: Sagar Muchhal <[email protected]>
- Sync VSphereMachines during day-2 operations in VMG controller
- Only wait for all intended VSphereMachines during initial Cluster creation
- Use annotations in VMG for per-md-zone info

Signed-off-by: Gong Zhang <[email protected]>
- Add VMG recociler unit test
- Bump VMOP due to API change
- Filter out VSphereMachine event except create/delete events

Signed-off-by: Gong Zhang <[email protected]>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 28, 2025
@zhanggbj zhanggbj changed the title [WIP][DoNotReview]Support Node Auto Placement and Node AF/AAF [WIP][DoNotReview]✨ Support Node Auto Placement and Node AF/AAF Oct 28, 2025
@zhanggbj zhanggbj force-pushed the node_auto_placement branch 3 times, most recently from 1160ce9 to 1cd61f9 Compare October 28, 2025 10:31
@zhanggbj
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@zhanggbj: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-vsphere-apidiff-main
pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
pull-cluster-api-provider-vsphere-test-main
pull-cluster-api-provider-vsphere-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zhanggbj
Copy link
Contributor Author

/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-govmomi-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-blocking-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-ci-latest-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-conformance-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-main
/test pull-cluster-api-provider-vsphere-e2e-supervisor-upgrade-1-34-1-35-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-govmomi-main
/test pull-cluster-api-provider-vsphere-e2e-vcsim-supervisor-main
/test pull-cluster-api-provider-vsphere-test-main
/test pull-cluster-api-provider-vsphere-verify-main
/test pull-cluster-api-provider-vsphere-apidiff-main
/test pull-cluster-api-provider-vsphere-janitor-main

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 7 times, most recently from 22604bf to 7c7c1a7 Compare November 3, 2025 08:55
@zhanggbj zhanggbj changed the title [WIP][DoNotReview]✨ Support Node Auto Placement and Node AF/AAF ✨ Support Node Auto Placement and Node AF/AAF Nov 11, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025
@zhanggbj
Copy link
Contributor Author

/assign @fabriziopandini @sbueringer @DanielXiao @silvery1622

PR is ready for review. Thanks in advance!

@k8s-ci-robot
Copy link
Contributor

@zhanggbj: GitHub didn't allow me to assign the following users: silvery1622.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @fabriziopandini @sbueringer @DanielXiao @silvery1622

PR is ready for review. Thanks in advance!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

return err
}

if feature.Gates.Enabled(feature.NamespaceScopedZones) && feature.Gates.Enabled(feature.NodeAutoPlacement) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a webhook validation needed to deny the mixed mode for creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be blocked by downstream. Mixed mode is not supported now but it's more like a short-term approach, I prefer not to add it in CAPV.
Leave this open for other's opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not adding additional validation here

(I think the validation here would mean we would need an additional validating webhook on the Cluster object)

return reconcile.Result{}, err
}
mdNames := []string{}
for _, md := range machineDeployments.Items {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider caching both name and failuredomain info for MD, so the loop for MDs is not needed in vmopmachine.go#246?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite get it. These are two files, virtualmachinegroup_reconciler.go and vmopmachine.go?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from fabriziopandini. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhanggbj zhanggbj force-pushed the node_auto_placement branch 8 times, most recently from afd77d2 to 233cf64 Compare November 12, 2025 16:13
- Update to watch Cluster as primary
- Decouple functions
- Update to accurate match when checking if VM is a member of VMG
- Update UT
- Refine godoc
- Miscellaneous

Signed-off-by: Gong Zhang <[email protected]>
@zhanggbj zhanggbj force-pushed the node_auto_placement branch from 233cf64 to e3078c9 Compare November 13, 2025 03:35
// VSphereMachineToCluster maps VSphereMachine events to Cluster reconcile requests.
// This handler only processes VSphereMachine objects for Day-2 operations when VMG could be found, ensuring
// VMG member list in sync with VSphereMachines. If no corresponding VMG is found, this is a no-op.
func (r *VirtualMachineGroupReconciler) VSphereMachineToCluster(ctx context.Context, a ctrlclient.Object) []reconcile.Request {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure but have one question. The max concurrent reconcile number virtualMachineGroupConcurrency is 10 by default.

Say a case like a VMG is supposed to have VM1, VM2, VM3:
At a time, VM1 caused this VMG in reconciling, and VM2 is created right after VM1 to trigger another worker to reconcile the same VMG. Is there any problem CreateOrPatch is called in both reconciles?

Copy link
Contributor

@DanielXiao DanielXiao Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question which may be picky, you don't have to solve it. Maybe member calculation logic should be moved to CreateOrPatch mutate function.

Copy link
Member

@sbueringer sbueringer Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no problem here.

Controller-runtime deduplicates reconcile.Requests and one Request (aka namespace/name) is at most reconciled once at a time.

If a Reconcile for a given Request is already in progress and another event comes in controller-runtime will just "enqueue" another Request that will be reconciled only after the current Reconcile is done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller-runtime will just hold another Request in the queue until the in progress Request is done, right? Thanks for clarifying.

Copy link
Contributor

@DanielXiao DanielXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question which may be picky, you don't have to solve it.

- apiGroups:
- vmoperator.vmware.com
resources:
- virtualmachinegroups/status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add all verbs on the marker for virtualmachineservices/status so it will be grouped in the RBAC rule above

(not sure how much sense this permission even makes, nobody does a get status :), but let's align to what we do for the other vmop resources)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated virtualmachineservices/status with all markers so it could be grouped.

Question: Seems CAPV has permission it doesn't truly use. Should we keep the permission minimal just as what controller used?

return err
}

if feature.Gates.Enabled(feature.NamespaceScopedZones) && feature.Gates.Enabled(feature.NodeAutoPlacement) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not adding additional validation here

(I think the validation here would mean we would need an additional validating webhook on the Cluster object)

Named("virtualmachinegroup").
Watches(
&vmoprv1.VirtualMachineGroup{},
handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}),
handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}),
ctrlbldr.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)),

We have not been using this in CAPV up until now, but let's use it here to already improve this new reconciler (I have a task to follow-up for others soon'ish).

Per default the reconciler will get resyncs from all informers. i.e. every 10m everything gets reconciled because of the Cluster resync and additionally every 10m everything gets reconciled because of the VirtualMachineGroup resync.

In core CAPI we drop events from resyncs of secondary watches to avoid this issue

(it's not needed for VSphereMachine below as we already drop all updates)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This help to reduce the resync noices.

Comment on lines 96 to 101
vmg := &vmoprv1.VirtualMachineGroup{}
err := r.Client.Get(ctx, apitypes.NamespacedName{Namespace: vSphereMachine.Namespace, Name: clusterName}, vmg)

if err != nil {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have this kind of logic in a mapper.

Let's deal with this only in the controller.

It's much more obvious there and the cache here can be out of sync.

Let's then use vSphereMachine.Namespace and clusterName for the request below

(please also update the godoc of this func)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make senses. Fixed it and also added predict in watch for VSphereMachine to only handle VSphereMachines which have MachineDeployment label.

clusterv1.MachineDeploymentNameLabel: nodePool,
},
},
TopologyKey: corev1.LabelTopologyZone,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a constant in the vmoperator API that we should use instead? (same for the other two cases below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VMOP is using the corev1 Labels Key, such as https://github.com/search?q=repo%3Avmware-tanzu%2Fvm-operator%20corev1.LabelTopologyZone&type=code. Seems their affinity design is following k8s pod affinity design.

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review only.

A few themes are emerging and it will be nice if they are addressed across the changeset:

  • we should add godoc keeping track of why we are doing things in a specific way, because this will help future us when doing maintenance
  • we should use consistently machineDeployment instead of nodePool, failureDomain instead of zone (I know there is tech debt, but we don't have to add on top, possibly improve)
  • logs should align to guidelines

vmOperatorVM = typedModified
}

if affinityInfo != nil && affinityInfo.affinitySpec != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to document why we are doing stuff, because this is the background that will be most valuable for future maintenance, eg.

Suggested change
if affinityInfo != nil && affinityInfo.affinitySpec != nil {
// Set Affinity and Group info on a VM only on creation,
// because, changing those info after placement won't have any effect.
if affinityInfo != nil && affinityInfo.affinitySpec != nil {

I would also probably revert the if, to check "if not set" (set on creation) first

Copy link
Contributor Author

@zhanggbj zhanggbj Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for clarification that affinity only impact VM locations during day-1.

I would also probably revert the if, to check "if not set" (set on creation) first

Could you help clarify this one?

Comment on lines 858 to 859
// However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer
// sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer).Í
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer
// sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer).Í
// However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer
// sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer).

Wondering if we should make this more explicit and set zonal to false when the Node Auto Placement feature flag is enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we keep it as today. Since when Node Auto Placement is enabled, we still need to set PVC annotation for control plane VMs.

I've added more clarification in the comments.

type affinityInfo struct {
affinitySpec *vmoprv1.AffinitySpec
vmGroupName string
failureDomain *string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we should we need a pointer here? (If i'm not wrong there is no difference between this being nil or empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using failureDomain string works here.
CAPV is using failureDomain * stringhttps://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/apis/v1beta1/vspheremachine_types.go#L117 but CAPI is using failureDomain string, how about we keep it consistent in CAPV? but I don't have strong opinion on this.


// affinityInfo is an internal to store VM affinity information.
type affinityInfo struct {
affinitySpec *vmoprv1.AffinitySpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we need a pointer here? looking at usage, whenever affinityInfo is set, also affinitySpec is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, whenever we set affinityInfo, we'll always set affinitySpec. Seems only checking affinityInfo != nil is fine.

However, since AffinitySpec contains lots of rules and arrays of MachineDeployments. Is it better to keep it *vmoprv1.AffinitySpec for less Mem usage?

log := ctrl.LoggerFrom(ctx)

// Calculate current Machines of all MachineDeployments.
current, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be

Suggested change
current, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name)
currentVSphereMachines, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if we should drop current prefix, because there is no corresponding desiredVSphereMachines list (but no strong opinions on this one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

As we have expectedVSphereMachineCount below, so used currentXXX here.


// Calculate expected Machines of all MachineDeployments.
// CAPV retrieves placement decisions from the VirtualMachineGroup to guide
// day-2 VM placement. At least one VM is required for each MachineDeployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one VM is required for each MachineDeployment.

I'm wondering if this assumption holds when using autoscaler from zero... 🤔

cc @sbueringer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this check only happens when VMG is not created yet, so for day-2 scale from zero, it won't go into this case.
In addition for day-1, we'll block create Cluster with MD.replica == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For day-2, since we'll get vmg successfully in line , so we won't run into this line.

// Wait for all intended VSphereMachines corresponding to MachineDeployment to exist only during initial Cluster creation.
current := int32(len(current))
if current < expected {
log.Info("current VSphereMachines do not match expected", "Expected:", expected,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this message start with Waiting for...

affInfo.failureDomain = ptr.To(zone)
}

// Fetch machine deployments without explicit failureDomain specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should skip computing AffinitySpec after the machine has been created (we not using it for update)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AffinitySpec doesn't impact after the initial cluster creation, but to keep it consistent, I think we should keep computing here. Especially in future, if AffinitySpec start to take effect during day-2, we don't need to update the code here.

- Refine VMG controller watch
- Refine naming, logging, godoc
- Miscellaneous

Signed-off-by: Gong Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants