Skip to content

Commit cea31d3

Browse files
committed
fix(resources): prevent indefinite blocking on cloud resource cleanup during deletion
When ensureCloudResourcesDestroyed() attempts to clean up guest cluster resources, it queries the guest cluster's KubeAPIServer. If the KubeAPIServer is already deleted during cluster deletion, these operations fail with connection errors, causing the CloudResourcesDestroyed condition to never become True, which blocks cluster deletion indefinitely. This fix implements two safety mechanisms to handle KubeAPIServer unavailability: 1. Early KAS check: Verify the kube-apiserver deployment exists in the control plane namespace before attempting cleanup. If not found, skip cleanup immediately as the guest cluster is already gone. 2. Connection error tracking: Track consecutive connection failures in-memory and skip cleanup after 5 attempts or 5 minutes, whichever comes first. This prevents infinite retry loops when the KubeAPIServer is unreachable. Key implementation details: - Added isKubeAPIServerAvailable() to check KAS deployment existence using the control plane client - Added isConnectionError() using proper K8s API errors (IsTimeout, IsServerTimeout, IsServiceUnavailable) and Go's net.Error interface instead of string matching - Implemented in-memory failure tracking with cleanupFailureTracker to avoid persisting state and potential API errors - Failure tracker is NOT reset when skipping due to max failures/timeout to prevent condition flip-flopping on subsequent reconciliations - Added comprehensive unit tests covering KAS unavailability, connection error detection, and failure tracking The implementation ensures stable CloudResourcesDestroyed condition status, allowing cluster deletion to proceed even when the guest cluster API is unavailable. Signed-off-by: Mulham Raee <[email protected]> Assisted-by: Claude 4.5 Sonnet (via Cursor)
1 parent a78c79e commit cea31d3

File tree

7 files changed

+790
-32
lines changed

7 files changed

+790
-32
lines changed

api/hypershift/v1beta1/hostedcluster_conditions.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,8 @@ const (
249249
KubeVirtNodesLiveMigratableReason = "KubeVirtNodesNotLiveMigratable"
250250

251251
RecoveryFinishedReason = "RecoveryFinished"
252+
253+
CloudResourcesCleanupSkippedReason = "CloudResourcesCleanupSkipped"
252254
)
253255

254256
// Messages.

control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2786,6 +2786,13 @@ func (r *HostedControlPlaneReconciler) removeCloudResources(ctx context.Context,
27862786
return true, nil
27872787
}
27882788

2789+
// check if cleanup has been skipped
2790+
if resourcesDestroyedCond != nil && resourcesDestroyedCond.Status == metav1.ConditionFalse &&
2791+
resourcesDestroyedCond.Reason == string(hyperv1.CloudResourcesCleanupSkippedReason) {
2792+
log.Info("Cleanup has been skipped", "reason", resourcesDestroyedCond.Message)
2793+
return true, nil
2794+
}
2795+
27892796
// if CVO has been scaled down, we're waiting for resources to be destroyed
27902797
cvoScaledDownCond := meta.FindStatusCondition(hcp.Status.Conditions, string(hyperv1.CVOScaledDown))
27912798
if cvoScaledDownCond != nil && cvoScaledDownCond.Status == metav1.ConditionTrue {

0 commit comments

Comments
 (0)