Skip to content

Conversation

@a-hilaly
Copy link
Member

Implement non blocking CRD reconciliation to improve controller throughput
when managing large amount of RGDs. The main idea is to free the workers to do
other tasks when we're waiting for the establishement.

Previously, the controller would block waiting for CRDs to become established,
preventing other ResourceGraphDefinitions from being processed. This change
introduces a requeue based approach where:

  • CRD creation happens immediately without blocking
  • controller checks CRD establishment status on each reconciliation
  • if CRD exists but isn't established, reconciler requeues after 500ms
  • other RGDs can be processed while waiting for CRD establishment

this patch also gets rid of the CRDWraper abstraction layer as it's no longer
needed.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 21, 2025
@a-hilaly a-hilaly added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2025
@a-hilaly a-hilaly removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2025
@a-hilaly a-hilaly force-pushed the perf-crds branch 2 times, most recently from cdb5e66 to ddba517 Compare October 12, 2025 02:03
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2025
Implement non-blocking CRD reconciliation to improve controller throughput
when managing large numbers of RGDs. The main idea is to free the workers to
do other tasks when waiting for CRD establishment or deletion.

Previously, the controller would block waiting for CRDs to become established,
preventing other ResourceGraphDefinitions from being processed. This change
introduces a requeue based approach where:

- CRD creation happens immediately without blocking
- controller checks CRD establishment status on each reconciliation
- if CRD exists but isn't established, reconciler requeues after 500ms
- other RGDs can be processed while waiting for CRD establishment

this patch also gets rid of the CRDWraper abstraction layer as it's no longer
needed.
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 28, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not enough, and probably near impossible to do proper testing for this feature at the integration level. Planning on adding unit tests to the controller.go where we properly test these things.

Copy link

@tjamet tjamet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it makes sense to tackle it together with KREP-004 it would possibly simplify how we deal with CRD management and allow simplify the reconciliation steps (dropping needs for requeues)

func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) {
if !o.DeletionTimestamp.IsZero() {
if err := r.cleanupResourceGraphDefinition(ctx, o); err != nil {
if needRequeue, duration := requeue.Check(err); needRequeue {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about extracting this logic to a distinct function
That way we keep our kro internal semantics and reduce risk of "forgetting error check for RequeueAfter" for future changes


// keep Reconcile(ctxm req) (result, obj) as this is the interface for controller-runtime
// TODO: check opportunities to converge interfaces with controllerRuntime for them
// to move to our requeue logic
func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (error) {
    err := r.reconcileRGD(ctx, o)
    if needRequeue, duration := requeue.Check(err); needRequeue {
        return ctrl.Result{RequeueAfter: duration}, nil
    }
   return ctrl.Result{}, nil
}

func (r *ResourceGraphDefinitionReconciler) reconcileRGD(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) {
// the current Reconcile
[...]

Comment on lines +64 to +67
if !completed {
log.V(1).Info("CRD deletion in progress, requeuing", "crd", crdName)
return requeue.NeededAfter(nil, crdDeletionRequeueDuration)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sense this has some overlap with KREP-004 (#763 )
with KREP-004 we wouldn't need to manange requeues on our own. RGD reconciliation would be re-triggered as the CRD got deleted

Comment on lines +103 to +121
crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
// CRD is gone, deletion complete
return true, nil
}
return false, err
}

// If CRD has no deletion timestamp, initiate deletion
if crd.DeletionTimestamp.IsZero() {
log.V(1).Info("initiating CRD deletion", "crd", name)
err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
return false, err
}
// Deletion initiated, but not complete yet
return false, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the get and DeletionTimestamp check required?

IIRC err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) triggers the deletion of the CRD async already. Shall the CRD already have a DeletionTimestamp the API should also return immediately. Am I missing anything?

if so, we could eventually avoid some API calls and simplify the code with just this (we may lose some granularity on the log side though)

Suggested change
crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
// CRD is gone, deletion complete
return true, nil
}
return false, err
}
// If CRD has no deletion timestamp, initiate deletion
if crd.DeletionTimestamp.IsZero() {
log.V(1).Info("initiating CRD deletion", "crd", name)
err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
return false, err
}
// Deletion initiated, but not complete yet
return false, nil
}
log.V(1).Info("ensure CRD is marked for deletion", "crd", name)
err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
return false, err
}
if apierrors.IsNotFound(err) {
log.V(1).Info("CRD deletion completed", "crd", name)
}
// Deletion initiated, but not complete yet
return false, nil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants