feat: implement non-blocking CRD reconciliation with requeue support #693

a-hilaly · 2025-09-21T08:22:08Z

Implement non blocking CRD reconciliation to improve controller throughput
when managing large amount of RGDs. The main idea is to free the workers to do
other tasks when we're waiting for the establishement.

Previously, the controller would block waiting for CRDs to become established,
preventing other ResourceGraphDefinitions from being processed. This change
introduces a requeue based approach where:

CRD creation happens immediately without blocking
controller checks CRD establishment status on each reconciliation
if CRD exists but isn't established, reconciler requeues after 500ms
other RGDs can be processed while waiting for CRD establishment

this patch also gets rid of the CRDWraper abstraction layer as it's no longer
needed.

k8s-ci-robot · 2025-09-21T08:22:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [a-hilaly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controller/resourcegraphdefinition/controller_reconcile.go

Implement non-blocking CRD reconciliation to improve controller throughput when managing large numbers of RGDs. The main idea is to free the workers to do other tasks when waiting for CRD establishment or deletion. Previously, the controller would block waiting for CRDs to become established, preventing other ResourceGraphDefinitions from being processed. This change introduces a requeue based approach where: - CRD creation happens immediately without blocking - controller checks CRD establishment status on each reconciliation - if CRD exists but isn't established, reconciler requeues after 500ms - other RGDs can be processed while waiting for CRD establishment this patch also gets rid of the CRDWraper abstraction layer as it's no longer needed.

a-hilaly · 2025-10-28T07:57:45Z

test/integration/suites/core/crd_test.go

I know this is not enough, and probably near impossible to do proper testing for this feature at the integration level. Planning on adding unit tests to the controller.go where we properly test these things.

tjamet

I wonder whether it makes sense to tackle it together with KREP-004 it would possibly simplify how we deal with CRD management and allow simplify the reconciliation steps (dropping needs for requeues)

tjamet · 2025-10-31T13:37:16Z

pkg/controller/resourcegraphdefinition/controller.go

 func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) {
 	if !o.DeletionTimestamp.IsZero() {
 		if err := r.cleanupResourceGraphDefinition(ctx, o); err != nil {
+			if needRequeue, duration := requeue.Check(err); needRequeue {


What about extracting this logic to a distinct function
That way we keep our kro internal semantics and reduce risk of "forgetting error check for RequeueAfter" for future changes

// keep Reconcile(ctxm req) (result, obj) as this is the interface for controller-runtime // TODO: check opportunities to converge interfaces with controllerRuntime for them // to move to our requeue logic func (r *ResourceGraphDefinitionReconciler) Reconcile(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (error) { err := r.reconcileRGD(ctx, o) if needRequeue, duration := requeue.Check(err); needRequeue { return ctrl.Result{RequeueAfter: duration}, nil } return ctrl.Result{}, nil } func (r *ResourceGraphDefinitionReconciler) reconcileRGD(ctx context.Context, o *v1alpha1.ResourceGraphDefinition) (ctrl.Result, error) { // the current Reconcile [...]

tjamet · 2025-10-31T13:38:57Z

pkg/controller/resourcegraphdefinition/controller_cleanup.go

+	if !completed {
+		log.V(1).Info("CRD deletion in progress, requeuing", "crd", crdName)
+		return requeue.NeededAfter(nil, crdDeletionRequeueDuration)
+	}


I sense this has some overlap with KREP-004 (#763 )
with KREP-004 we wouldn't need to manange requeues on our own. RGD reconciliation would be re-triggered as the CRD got deleted

tjamet · 2025-10-31T13:51:40Z

pkg/controller/resourcegraphdefinition/controller_cleanup.go

+	crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{})
+	if err != nil {
+		if apierrors.IsNotFound(err) {
+			// CRD is gone, deletion complete
+			return true, nil
+		}
+		return false, err
+	}
+
+	// If CRD has no deletion timestamp, initiate deletion
+	if crd.DeletionTimestamp.IsZero() {
+		log.V(1).Info("initiating CRD deletion", "crd", name)
+		err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})
+		if err != nil && !apierrors.IsNotFound(err) {
+			return false, err
+		}
+		// Deletion initiated, but not complete yet
+		return false, nil
+	}


Why is the get and DeletionTimestamp check required?

IIRC err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{}) triggers the deletion of the CRD async already. Shall the CRD already have a DeletionTimestamp the API should also return immediately. Am I missing anything?

if so, we could eventually avoid some API calls and simplify the code with just this (we may lose some granularity on the log side though)

Suggested change

crd, err := r.clientSet.CRD().Get(ctx, name, metav1.GetOptions{})

if err != nil {

if apierrors.IsNotFound(err) {

// CRD is gone, deletion complete

return true, nil

}

return false, err

}

// If CRD has no deletion timestamp, initiate deletion

if crd.DeletionTimestamp.IsZero() {

log.V(1).Info("initiating CRD deletion", "crd", name)

err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})

if err != nil && !apierrors.IsNotFound(err) {

return false, err

}

// Deletion initiated, but not complete yet

return false, nil

}

log.V(1).Info("ensure CRD is marked for deletion", "crd", name)

err = r.clientSet.CRD().Delete(ctx, name, metav1.DeleteOptions{})

if err != nil && !apierrors.IsNotFound(err) {

return false, err

}

if apierrors.IsNotFound(err) {

log.V(1).Info("CRD deletion completed", "crd", name)

}

// Deletion initiated, but not complete yet

return false, nil

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 21, 2025

k8s-ci-robot requested review from JoelSpeed and bridgetkromhout September 21, 2025 08:22

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 21, 2025

a-hilaly added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2025

a-hilaly force-pushed the perf-crds branch from e4e0d46 to 5cf0211 Compare September 21, 2025 20:47

a-hilaly removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025

a-hilaly commented Sep 28, 2025

View reviewed changes

pkg/controller/resourcegraphdefinition/controller_reconcile.go Outdated Show resolved Hide resolved

a-hilaly force-pushed the perf-crds branch from 5cf0211 to a6e1350 Compare October 10, 2025 14:36

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2025

a-hilaly force-pushed the perf-crds branch 2 times, most recently from cdb5e66 to ddba517 Compare October 12, 2025 02:03

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2025

a-hilaly force-pushed the perf-crds branch from ddba517 to 2250806 Compare October 28, 2025 07:55

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 28, 2025

a-hilaly commented Oct 28, 2025

View reviewed changes

tjamet reviewed Oct 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement non-blocking CRD reconciliation with requeue support #693

feat: implement non-blocking CRD reconciliation with requeue support #693

Uh oh!

a-hilaly commented Sep 21, 2025

Uh oh!

k8s-ci-robot commented Sep 21, 2025

Uh oh!

Uh oh!

a-hilaly Oct 28, 2025

Uh oh!

tjamet left a comment

Uh oh!

tjamet Oct 31, 2025

Uh oh!

tjamet Oct 31, 2025

Uh oh!

tjamet Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: implement non-blocking CRD reconciliation with requeue support #693

Are you sure you want to change the base?

feat: implement non-blocking CRD reconciliation with requeue support #693

Uh oh!

Conversation

a-hilaly commented Sep 21, 2025

Uh oh!

k8s-ci-robot commented Sep 21, 2025

Uh oh!

Uh oh!

a-hilaly Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

tjamet left a comment

Choose a reason for hiding this comment

Uh oh!

tjamet Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

tjamet Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

tjamet Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants