-
Notifications
You must be signed in to change notification settings - Fork 304
✨ Support Node Auto Placement and Node AF/AAF #3655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Bump VMOP including Node AF/AAF support - Add NodeAutoPlacement Feature Gate (cherry picked from commit 700c8ae)
(cherry picked from commit cfeb862)
Removes the extra cases for VMG creation, such that VMG is created for: 1. Multiple zones, multiple MDs with no failureDomain 2. Multiple zones, multiple MDs with failureDomain 3. Single zone, existing cluster with no failureDomain MDs Signed-off-by: Sagar Muchhal <[email protected]>
- Updates VMOP API dependency Misc VMG fixes - Use namingStrategy to calculate VM names - Use MachineDeployment names for VMG placement label - Includes all machinedeployments to generate node-pool -> zone mapping Fixes VMG webhook validation error - Adds cluster-name label to Af/AAF spec - re-adds zone topology key back to anti-aff spec Signed-off-by: Sagar Muchhal <[email protected]>
Signed-off-by: Sagar Muchhal <[email protected]>
Signed-off-by: Sagar Muchhal <[email protected]>
…gs#71) * Refine VMG controller when generate per-MD zone labels - Skip legacy already-placed VM which do not have placement info - Skip VM which do not have zone info * Apply suggestions from code review --------- Co-authored-by: Sagar Muchhal <[email protected]>
- Sync VSphereMachines during day-2 operations in VMG controller - Only wait for all intended VSphereMachines during initial Cluster creation - Use annotations in VMG for per-md-zone info Signed-off-by: Gong Zhang <[email protected]>
- Add VMG recociler unit test - Bump VMOP due to API change - Filter out VSphereMachine event except create/delete events Signed-off-by: Gong Zhang <[email protected]>
Signed-off-by: Gong Zhang <[email protected]>
Signed-off-by: Gong Zhang <[email protected]>
1160ce9 to
1cd61f9
Compare
|
/test ? |
|
@zhanggbj: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-cluster-api-provider-vsphere-e2e-govmomi-blocking-main |
22604bf to
7c7c1a7
Compare
|
/assign @fabriziopandini @sbueringer @DanielXiao @silvery1622 PR is ready for review. Thanks in advance! |
|
@zhanggbj: GitHub didn't allow me to assign the following users: silvery1622. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| return err | ||
| } | ||
|
|
||
| if feature.Gates.Enabled(feature.NamespaceScopedZones) && feature.Gates.Enabled(feature.NodeAutoPlacement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a webhook validation needed to deny the mixed mode for creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be blocked by downstream. Mixed mode is not supported now but it's more like a short-term approach, I prefer not to add it in CAPV.
Leave this open for other's opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with not adding additional validation here
(I think the validation here would mean we would need an additional validating webhook on the Cluster object)
| return reconcile.Result{}, err | ||
| } | ||
| mdNames := []string{} | ||
| for _, md := range machineDeployments.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider caching both name and failuredomain info for MD, so the loop for MDs is not needed in vmopmachine.go#246?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite get it. These are two files, virtualmachinegroup_reconciler.go and vmopmachine.go?
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
afd77d2 to
233cf64
Compare
- Update to watch Cluster as primary - Decouple functions - Update to accurate match when checking if VM is a member of VMG - Update UT - Refine godoc - Miscellaneous Signed-off-by: Gong Zhang <[email protected]>
233cf64 to
e3078c9
Compare
| // VSphereMachineToCluster maps VSphereMachine events to Cluster reconcile requests. | ||
| // This handler only processes VSphereMachine objects for Day-2 operations when VMG could be found, ensuring | ||
| // VMG member list in sync with VSphereMachines. If no corresponding VMG is found, this is a no-op. | ||
| func (r *VirtualMachineGroupReconciler) VSphereMachineToCluster(ctx context.Context, a ctrlclient.Object) []reconcile.Request { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure but have one question. The max concurrent reconcile number virtualMachineGroupConcurrency is 10 by default.
Say a case like a VMG is supposed to have VM1, VM2, VM3:
At a time, VM1 caused this VMG in reconciling, and VM2 is created right after VM1 to trigger another worker to reconcile the same VMG. Is there any problem CreateOrPatch is called in both reconciles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question which may be picky, you don't have to solve it. Maybe member calculation logic should be moved to CreateOrPatch mutate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no problem here.
Controller-runtime deduplicates reconcile.Requests and one Request (aka namespace/name) is at most reconciled once at a time.
If a Reconcile for a given Request is already in progress and another event comes in controller-runtime will just "enqueue" another Request that will be reconciled only after the current Reconcile is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller-runtime will just hold another Request in the queue until the in progress Request is done, right? Thanks for clarifying.
DanielXiao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question which may be picky, you don't have to solve it.
config/rbac/role.yaml
Outdated
| - apiGroups: | ||
| - vmoperator.vmware.com | ||
| resources: | ||
| - virtualmachinegroups/status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just add all verbs on the marker for virtualmachineservices/status so it will be grouped in the RBAC rule above
(not sure how much sense this permission even makes, nobody does a get status :), but let's align to what we do for the other vmop resources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated virtualmachineservices/status with all markers so it could be grouped.
Question: Seems CAPV has permission it doesn't truly use. Should we keep the permission minimal just as what controller used?
| return err | ||
| } | ||
|
|
||
| if feature.Gates.Enabled(feature.NamespaceScopedZones) && feature.Gates.Enabled(feature.NodeAutoPlacement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with not adding additional validation here
(I think the validation here would mean we would need an additional validating webhook on the Cluster object)
| Named("virtualmachinegroup"). | ||
| Watches( | ||
| &vmoprv1.VirtualMachineGroup{}, | ||
| handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}), | |
| handler.EnqueueRequestForOwner(mgr.GetScheme(), reconciler.Client.RESTMapper(), &clusterv1.Cluster{}), | |
| ctrlbldr.WithPredicates(predicates.ResourceIsChanged(mgr.GetScheme(), predicateLog)), |
We have not been using this in CAPV up until now, but let's use it here to already improve this new reconciler (I have a task to follow-up for others soon'ish).
Per default the reconciler will get resyncs from all informers. i.e. every 10m everything gets reconciled because of the Cluster resync and additionally every 10m everything gets reconciled because of the VirtualMachineGroup resync.
In core CAPI we drop events from resyncs of secondary watches to avoid this issue
(it's not needed for VSphereMachine below as we already drop all updates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This help to reduce the resync noices.
| vmg := &vmoprv1.VirtualMachineGroup{} | ||
| err := r.Client.Get(ctx, apitypes.NamespacedName{Namespace: vSphereMachine.Namespace, Name: clusterName}, vmg) | ||
|
|
||
| if err != nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not have this kind of logic in a mapper.
Let's deal with this only in the controller.
It's much more obvious there and the cache here can be out of sync.
Let's then use vSphereMachine.Namespace and clusterName for the request below
(please also update the godoc of this func)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make senses. Fixed it and also added predict in watch for VSphereMachine to only handle VSphereMachines which have MachineDeployment label.
| clusterv1.MachineDeploymentNameLabel: nodePool, | ||
| }, | ||
| }, | ||
| TopologyKey: corev1.LabelTopologyZone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a constant in the vmoperator API that we should use instead? (same for the other two cases below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VMOP is using the corev1 Labels Key, such as https://github.com/search?q=repo%3Avmware-tanzu%2Fvm-operator%20corev1.LabelTopologyZone&type=code. Seems their affinity design is following k8s pod affinity design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review only.
A few themes are emerging and it will be nice if they are addressed across the changeset:
- we should add godoc keeping track of why we are doing things in a specific way, because this will help future us when doing maintenance
- we should use consistently machineDeployment instead of nodePool, failureDomain instead of zone (I know there is tech debt, but we don't have to add on top, possibly improve)
- logs should align to guidelines
| vmOperatorVM = typedModified | ||
| } | ||
|
|
||
| if affinityInfo != nil && affinityInfo.affinitySpec != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to document why we are doing stuff, because this is the background that will be most valuable for future maintenance, eg.
| if affinityInfo != nil && affinityInfo.affinitySpec != nil { | |
| // Set Affinity and Group info on a VM only on creation, | |
| // because, changing those info after placement won't have any effect. | |
| if affinityInfo != nil && affinityInfo.affinitySpec != nil { |
I would also probably revert the if, to check "if not set" (set on creation) first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments for clarification that affinity only impact VM locations during day-1.
I would also probably revert the if, to check "if not set" (set on creation) first
Could you help clarify this one?
| // However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer | ||
| // sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer).Í |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer | |
| // sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer).Í | |
| // However, with Node Auto Placement enabled, failureDomain is optional and CAPV no longer | |
| // sets PVC annotations. PVC placement now follows the StorageClass behavior (Immediate or WaitForFirstConsumer). |
Wondering if we should make this more explicit and set zonal to false when the Node Auto Placement feature flag is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we keep it as today. Since when Node Auto Placement is enabled, we still need to set PVC annotation for control plane VMs.
I've added more clarification in the comments.
| type affinityInfo struct { | ||
| affinitySpec *vmoprv1.AffinitySpec | ||
| vmGroupName string | ||
| failureDomain *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we should we need a pointer here? (If i'm not wrong there is no difference between this being nil or empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using failureDomain string works here.
CAPV is using failureDomain * stringhttps://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/apis/v1beta1/vspheremachine_types.go#L117 but CAPI is using failureDomain string, how about we keep it consistent in CAPV? but I don't have strong opinion on this.
|
|
||
| // affinityInfo is an internal to store VM affinity information. | ||
| type affinityInfo struct { | ||
| affinitySpec *vmoprv1.AffinitySpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we need a pointer here? looking at usage, whenever affinityInfo is set, also affinitySpec is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, whenever we set affinityInfo, we'll always set affinitySpec. Seems only checking affinityInfo != nil is fine.
However, since AffinitySpec contains lots of rules and arrays of MachineDeployments. Is it better to keep it *vmoprv1.AffinitySpec for less Mem usage?
| log := ctrl.LoggerFrom(ctx) | ||
|
|
||
| // Calculate current Machines of all MachineDeployments. | ||
| current, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be
| current, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name) | |
| currentVSphereMachines, err := getCurrentVSphereMachines(ctx, r.Client, cluster.Namespace, cluster.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering if we should drop current prefix, because there is no corresponding desiredVSphereMachines list (but no strong opinions on this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
As we have expectedVSphereMachineCount below, so used currentXXX here.
|
|
||
| // Calculate expected Machines of all MachineDeployments. | ||
| // CAPV retrieves placement decisions from the VirtualMachineGroup to guide | ||
| // day-2 VM placement. At least one VM is required for each MachineDeployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least one VM is required for each MachineDeployment.
I'm wondering if this assumption holds when using autoscaler from zero... 🤔
cc @sbueringer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this check only happens when VMG is not created yet, so for day-2 scale from zero, it won't go into this case.
In addition for day-1, we'll block create Cluster with MD.replica == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For day-2, since we'll get vmg successfully in line , so we won't run into this line.
| // Wait for all intended VSphereMachines corresponding to MachineDeployment to exist only during initial Cluster creation. | ||
| current := int32(len(current)) | ||
| if current < expected { | ||
| log.Info("current VSphereMachines do not match expected", "Expected:", expected, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this message start with Waiting for...
| affInfo.failureDomain = ptr.To(zone) | ||
| } | ||
|
|
||
| // Fetch machine deployments without explicit failureDomain specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should skip computing AffinitySpec after the machine has been created (we not using it for update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AffinitySpec doesn't impact after the initial cluster creation, but to keep it consistent, I think we should keep computing here. Especially in future, if AffinitySpec start to take effect during day-2, we don't need to update the code here.
- Refine VMG controller watch - Refine naming, logging, godoc - Miscellaneous Signed-off-by: Gong Zhang <[email protected]>
What this PR does / why we need it:
Support Node Auto Placement and Node AF/AAF
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #