-
Notifications
You must be signed in to change notification settings - Fork 255
feat: implement non-blocking CRD reconciliation with requeue support #693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a-hilaly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cdb5e66 to
ddba517
Compare
Implement non-blocking CRD reconciliation to improve controller throughput when managing large numbers of RGDs. The main idea is to free the workers to do other tasks when waiting for CRD establishment or deletion. Previously, the controller would block waiting for CRDs to become established, preventing other ResourceGraphDefinitions from being processed. This change introduces a requeue based approach where: - CRD creation happens immediately without blocking - controller checks CRD establishment status on each reconciliation - if CRD exists but isn't established, reconciler requeues after 500ms - other RGDs can be processed while waiting for CRD establishment this patch also gets rid of the CRDWraper abstraction layer as it's no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not enough, and probably near impossible to do proper testing for this feature at the integration level. Planning on adding unit tests to the controller.go where we properly test these things.
tjamet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it makes sense to tackle it together with KREP-004 it would possibly simplify how we deal with CRD management and allow simplify the reconciliation steps (dropping needs for requeues)
| func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) { | ||
| if !o.DeletionTimestamp.IsZero() { | ||
| if err := r.cleanupResourceGraphDefinition(ctx, o); err != nil { | ||
| if needRequeue, duration := requeue.Check(err); needRequeue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about extracting this logic to a distinct function
That way we keep our kro internal semantics and reduce risk of "forgetting error check for RequeueAfter" for future changes
// keep Reconcile(ctxm req) (result, obj) as this is the interface for controller-runtime
// TODO: check opportunities to converge interfaces with controllerRuntime for them
// to move to our requeue logic
func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (error) {
err := r.reconcileRGD(ctx, o)
if needRequeue, duration := requeue.Check(err); needRequeue {
return ctrl.Result{RequeueAfter: duration}, nil
}
return ctrl.Result{}, nil
}
func (r *ResourceGraphDefinitionReconciler) reconcileRGD(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) {
// the current Reconcile
[...]
| if !completed { | ||
| log.V(1).Info("CRD deletion in progress, requeuing", "crd", crdName) | ||
| return requeue.NeededAfter(nil, crdDeletionRequeueDuration) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sense this has some overlap with KREP-004 (#763 )
with KREP-004 we wouldn't need to manange requeues on our own. RGD reconciliation would be re-triggered as the CRD got deleted
| crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{}) | ||
| if err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| // CRD is gone, deletion complete | ||
| return true, nil | ||
| } | ||
| return false, err | ||
| } | ||
|
|
||
| // If CRD has no deletion timestamp, initiate deletion | ||
| if crd.DeletionTimestamp.IsZero() { | ||
| log.V(1).Info("initiating CRD deletion", "crd", name) | ||
| err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) | ||
| if err != nil && !apierrors.IsNotFound(err) { | ||
| return false, err | ||
| } | ||
| // Deletion initiated, but not complete yet | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the get and DeletionTimestamp check required?
IIRC err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) triggers the deletion of the CRD async already. Shall the CRD already have a DeletionTimestamp the API should also return immediately. Am I missing anything?
if so, we could eventually avoid some API calls and simplify the code with just this (we may lose some granularity on the log side though)
| crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{}) | |
| if err != nil { | |
| if apierrors.IsNotFound(err) { | |
| // CRD is gone, deletion complete | |
| return true, nil | |
| } | |
| return false, err | |
| } | |
| // If CRD has no deletion timestamp, initiate deletion | |
| if crd.DeletionTimestamp.IsZero() { | |
| log.V(1).Info("initiating CRD deletion", "crd", name) | |
| err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) | |
| if err != nil && !apierrors.IsNotFound(err) { | |
| return false, err | |
| } | |
| // Deletion initiated, but not complete yet | |
| return false, nil | |
| } | |
| log.V(1).Info("ensure CRD is marked for deletion", "crd", name) | |
| err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) | |
| if err != nil && !apierrors.IsNotFound(err) { | |
| return false, err | |
| } | |
| if apierrors.IsNotFound(err) { | |
| log.V(1).Info("CRD deletion completed", "crd", name) | |
| } | |
| // Deletion initiated, but not complete yet | |
| return false, nil |
Implement non blocking CRD reconciliation to improve controller throughput
when managing large amount of RGDs. The main idea is to free the workers to do
other tasks when we're waiting for the establishement.
Previously, the controller would block waiting for CRDs to become established,
preventing other ResourceGraphDefinitions from being processed. This change
introduces a requeue based approach where:
this patch also gets rid of the CRDWraper abstraction layer as it's no longer
needed.