diff --git a/.wordlist.txt b/.wordlist.txt index a45df5f8..73309c17 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -453,6 +453,7 @@ NVPTX Nano Navi Noncoherently +NoSchedule NousResearch's NumPy OAM diff --git a/docs/autoremediation/auto-remediation.md b/docs/autoremediation/auto-remediation.md index b0ca6657..5b396f9a 100644 --- a/docs/autoremediation/auto-remediation.md +++ b/docs/autoremediation/auto-remediation.md @@ -88,31 +88,61 @@ The GPU Operator installs Argo Workflows v4.0.3, using a [customized installatio For OpenShift users: To use the auto remediation feature, additonal steps are required to install Argo Workflows to the OpenShift cluster, which requires special consideration: - 1. **If using OpenShift AI Operator with CRD `DataScienceCluster`:** Argo Workflows are possibly already deployed by the OpenShift AI Operator, if the Custom Resource Definition (CRD) like workflows.argoproj.io is already existing, no additional installation is needed. +1. **If using OpenShift AI Operator with CRD `DataScienceCluster`:** Argo Workflows are possibly already deployed by the OpenShift AI Operator, if the CustomResourceDefinition like workflows.argoproj.io is already existing, no additional installation is needed. - 2. **If not using OpenShift AI Operator:** Follow these steps to install Argo Workflows on your OpenShift cluster: +2. **If not using OpenShift AI Operator:** Follow these steps to install Argo Workflows on your OpenShift cluster: a. Install CRDs (must be executed separately due to CRD size): - ```bash - oc apply --server-side --force-conflicts -k "https://github.com/argoproj/argo-workflows/manifests/base/crds/full?ref=v4.0.3" - ``` +```bash +oc apply --server-side --force-conflicts -k "https://github.com/argoproj/argo-workflows/manifests/base/crds/full?ref=v4.0.3" +``` b. Add the Argo Helm repository: - ```bash - helm repo add argo https://argoproj.github.io/argo-helm --force-update - ``` +```bash +helm repo add argo https://argoproj.github.io/argo-helm --force-update +``` c. Install Argo Workflows using Helm: - ```bash - helm install argo-workflow argo/argo-workflows \ +```bash +helm install argo-workflow argo/argo-workflows \ -n argo-workflow \ --create-namespace \ --version=1.0.6 \ - --set crds.install=false - ``` + --set crds.install=false \ + --set controller.instanceID.enabled=true \ + --set controller.instanceID.explicitID=amd-gpu-operator-remediation-workflow \ + --set 'controller.tolerations[0].key=amd-gpu-unhealthy' \ + --set 'controller.tolerations[0].operator=Exists' \ + --set 'controller.tolerations[0].effect=NoSchedule' +``` + +> **Important:** The `controller.instanceID.explicitID` value must be set to `amd-gpu-operator-remediation-workflow`. The GPU Operator labels every remediation workflow and workflow template it creates with `workflows.argoproj.io/controller-instanceid: amd-gpu-operator-remediation-workflow`. An Argo workflow-controller only reconciles workflows whose `controller-instanceid` label matches its configured `instanceID`, so without this setting the Helm-installed controller will silently ignore the operator's workflows. Refer to the [Argo Workflows controller `instanceID` documentation](https://argo-workflows.readthedocs.io/en/stable/scaling/#instanceid) and the [`argo-workflows` chart values](https://github.com/argoproj/argo-helm/blob/main/charts/argo-workflows/values.yaml) for full details. +> +> **Important:** The `controller.tolerations` entry for `amd-gpu-unhealthy:NoSchedule` is required. During remediation the GPU Operator taints the affected node with `amd-gpu-unhealthy:NoSchedule` (see [NodeRemediationTaints](#node-drain-policy-configuration)). If the workflow-controller pod happens to be scheduled on a node that later gets tainted, it will be evicted and remediation will stall. Adding this toleration ensures the controller keeps running on tainted nodes so it can continue driving the workflow to completion. The same toleration is applied to the in-tree workflow controller, metrics-exporter, and other operator-managed components. +> +> The **same toleration must also be added to the Kernel Module Management (KMM) operator's Helm chart** when KMM is installed separately on OpenShift. KMM is responsible for (re)building and loading the GPU driver kernel module on the node after a remediation reboot — if its controller pod cannot tolerate `amd-gpu-unhealthy:NoSchedule`, it may be evicted from a tainted node and the driver will never be reloaded, blocking the post-reboot validation step of the workflow. When installing the KMM operator via Helm, pass the equivalent flags (the exact key path depends on the KMM chart you use, e.g. `controller.manager.tolerations` for the upstream chart): +> +> ```bash +> --set 'controller.manager.tolerations[0].key=amd-gpu-unhealthy' \ +> --set 'controller.manager.tolerations[0].operator=Exists' \ +> --set 'controller.manager.tolerations[0].effect=NoSchedule' +> ``` + +If you prefer a values file over `--set`, the equivalent block is: + +```yaml +controller: + instanceID: + enabled: true + explicitID: amd-gpu-operator-remediation-workflow + tolerations: + - key: amd-gpu-unhealthy + operator: Exists + effect: NoSchedule +``` ## Configuration and customization @@ -292,7 +322,7 @@ The following example demonstrates a complete error mapping configuration: **notifyTestFailureMessage** - Contains instructions to be displayed when validation tests fail after remediation attempts. This message typically includes escalation procedures and diagnostic information requirements. -**recoveryPolicy** - Defines limits on remediation attempts to prevent excessive recovery cycles. Includes `maxAllowedRunsPerWindow` (maximum retry attempts) and `windowSize` (time window for counting attempts). When exceeded, the workflow pauses for manual intervention. +**recoveryPolicy** - Defines limits on remediation attempts to prevent excessive recovery cycles. Includes `maxAllowedRunsPerWindow` (maximum retry attempts) and `windowSize` (time window for counting attempts). Once the number of remediation workflows crosses `maxAllowedRunsPerWindow`, no new workflow is triggered for the same node condition within `windowSize`. After the window elapses, if the issue still persists, a new remediation workflow is allowed to start again. **skipRebootStep** - Controls whether the node reboot step is executed during the remediation workflow. The default workflow template includes an automatic reboot step to reinitialize GPU hardware after performing the recommended remediation actions. Set this field to `true` to skip the reboot step when the node has already been rebooted manually as part of the remediation process or when a reboot is not desired for the specific error condition. Default value is `false`. diff --git a/internal/controllers/mock_remediation_handler.go b/internal/controllers/mock_remediation_handler.go index 8b4c4317..d4c50c18 100644 --- a/internal/controllers/mock_remediation_handler.go +++ b/internal/controllers/mock_remediation_handler.go @@ -636,6 +636,21 @@ func (mr *MockremediationMgrHelperAPIMockRecorder) isRemediationDisabled(ctx, de return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isRemediationDisabled", reflect.TypeOf((*MockremediationMgrHelperAPI)(nil).isRemediationDisabled), ctx, devConfig) } +// isWorkflowCRDPresent mocks base method. +func (m *MockremediationMgrHelperAPI) isWorkflowCRDPresent(ctx context.Context) (bool, error) { + m.ctrl.T.Helper() + ret := m.ctrl.Call(m, "isWorkflowCRDPresent", ctx) + ret0, _ := ret[0].(bool) + ret1, _ := ret[1].(error) + return ret0, ret1 +} + +// isWorkflowCRDPresent indicates an expected call of isWorkflowCRDPresent. +func (mr *MockremediationMgrHelperAPIMockRecorder) isWorkflowCRDPresent(ctx any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "isWorkflowCRDPresent", reflect.TypeOf((*MockremediationMgrHelperAPI)(nil).isWorkflowCRDPresent), ctx) +} + // isWorkflowSchedulableOnNode mocks base method. func (m *MockremediationMgrHelperAPI) isWorkflowSchedulableOnNode(ctx context.Context, devConfig *v1alpha1.DeviceConfig, node *v1.Node, mapping ConditionWorkflowMapping) bool { m.ctrl.T.Helper() diff --git a/internal/controllers/remediation_handler.go b/internal/controllers/remediation_handler.go index 55b7d8ba..6d5f5782 100644 --- a/internal/controllers/remediation_handler.go +++ b/internal/controllers/remediation_handler.go @@ -51,6 +51,7 @@ import ( workflowv1alpha1 "github.com/argoproj/argo-workflows/v4/pkg/apis/workflow/v1alpha1" appsv1 "k8s.io/api/apps/v1" v1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/meta" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/labels" "k8s.io/client-go/kubernetes" @@ -253,10 +254,27 @@ func (n *remediationMgr) HandleRemediation(ctx context.Context, devConfig *amdv1 // HandleDelete handles the delete operations during remediation process func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1alpha1.DeviceConfig, nodeList *v1.NodeList) (res ctrl.Result, err error) { res = ctrl.Result{Requeue: true, RequeueAfter: time.Second * 20} - remediationDisabled, err := n.helper.isRemediationDisabled(ctx, deviceConfig) - if err != nil || remediationDisabled { + + if deviceConfig.Spec.RemediationWorkflow.Config == nil || deviceConfig.Spec.RemediationWorkflow.Config.Name == "" { + cfgMapName := deviceConfig.Name + "-" + DefaultConfigMapSuffix + if _, getErr := n.helper.getConfigMap(ctx, cfgMapName, deviceConfig.Namespace); getErr == nil { + if err := n.helper.deleteConfigMap(ctx, cfgMapName, deviceConfig.Namespace); err == nil { + log.FromContext(ctx).Info(fmt.Sprintf("Deleted ConfigMap: %s", cfgMapName)) + } else { + log.FromContext(ctx).Error(err, fmt.Sprintf("Failed to delete ConfigMap: %s", cfgMapName)) + } + } + } + + crdPresent, err := n.helper.isWorkflowCRDPresent(ctx) + if err != nil { + log.FromContext(ctx).Error(err, "Failed to check if Argo Workflow CRD is present") return res, err } + if !crdPresent { + log.FromContext(ctx).Info("Argo Workflow CRD not found in the cluster, skipping workflow cleanup") + return res, nil + } wfList, err := n.helper.getWorkflowList(ctx, deviceConfig.Namespace) if err != nil { @@ -271,13 +289,6 @@ func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1al log.FromContext(ctx).Info(fmt.Sprintf("Deleted workflow: %s", wf.Name)) } - if deviceConfig.Spec.RemediationWorkflow.Config == nil || deviceConfig.Spec.RemediationWorkflow.Config.Name == "" { - cfgMapName := deviceConfig.Name + "-" + DefaultConfigMapSuffix - if err := n.helper.deleteConfigMap(ctx, cfgMapName, deviceConfig.Namespace); err == nil { - log.FromContext(ctx).Info(fmt.Sprintf("Deleted ConfigMap: %s", cfgMapName)) - } - } - return } @@ -287,6 +298,7 @@ func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1al type remediationMgrHelperAPI interface { getServiceAccountName(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) string isRemediationDisabled(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) (bool, error) + isWorkflowCRDPresent(ctx context.Context) (bool, error) resumeSuspendedWorkflow(ctx context.Context, wfName, namespace string) error isDriverUpgradeInProgress(devCfg *amdv1alpha1.DeviceConfig, node *v1.Node) bool checkIfTaintExists(node *v1.Node, devConfig *amdv1alpha1.DeviceConfig, nodeCondition string) bool @@ -410,6 +422,19 @@ func (h *remediationMgrHelper) isRemediationDisabled(ctx context.Context, devCon return false, nil } +// isWorkflowCRDPresent reports whether the Argo Workflow CRD is registered in the cluster. +// It uses the RESTMapper to detect the resource without requiring a cluster-scoped CRD GET. +func (h *remediationMgrHelper) isWorkflowCRDPresent(ctx context.Context) (bool, error) { + gvk := workflowv1alpha1.WorkflowSchemaGroupVersionKind + if _, err := h.client.RESTMapper().RESTMapping(gvk.GroupKind(), gvk.Version); err != nil { + if meta.IsNoMatchError(err) { + return false, nil + } + return false, err + } + return true, nil +} + func (h *remediationMgrHelper) validateUserConfigMap(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) error { if devConfig.Spec.RemediationWorkflow.Config == nil || devConfig.Spec.RemediationWorkflow.Config.Name == "" { return nil @@ -1936,3 +1961,113 @@ func (h *remediationMgrHelper) handleDeviceConfigChanges(ctx context.Context, de } } } + +func (h *remediationMgrHelper) createConfigMapFromImage(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) (ctrl.Result, error) { + logger := log.FromContext(ctx) + image := devConfig.Spec.RemediationWorkflow.ConfigMapImage + if image == "" { + return ctrl.Result{}, nil + } + + jobName := devConfig.Name + "-" + ConfigMapImageJobSuffix + namespace := devConfig.Namespace + configMapName := devConfig.Name + "-" + DefaultConfigMapSuffix + + // Check if the ConfigMap already exists. + existingCM, cmErr := h.getConfigMap(ctx, configMapName, namespace) + if cmErr == nil { + // ConfigMap exists - check if it was created by the same image. + if existingCM.Annotations != nil && existingCM.Annotations[ConfigMapImageAnnotationKey] == image { + return ctrl.Result{}, nil + } + // Image changed - delete the stale ConfigMap so the new image can recreate it. + if err := h.deleteConfigMap(ctx, configMapName, namespace); err != nil { + return ctrl.Result{}, fmt.Errorf("failed to delete stale configMap for image change: %w", err) + } + logger.Info("Deleted stale ConfigMap due to image change", "configMap", configMapName, "newImage", image) + } + + // Check if a Job is already in-flight. + existingJob := &batchv1.Job{} + if err := h.client.Get(ctx, client.ObjectKey{Name: jobName, Namespace: namespace}, existingJob); err == nil { + for _, cond := range existingJob.Status.Conditions { + if cond.Type == batchv1.JobComplete && cond.Status == v1.ConditionTrue { + logger.Info("ConfigMap image Job completed successfully", "job", jobName) + return ctrl.Result{}, nil + } + if cond.Type == batchv1.JobFailed && cond.Status == v1.ConditionTrue { + return ctrl.Result{}, fmt.Errorf("configMap image Job %s failed: %s", jobName, cond.Message) + } + } + // Job still running - check if it's for the current image. + if len(existingJob.Spec.Template.Spec.Containers) > 0 && + existingJob.Spec.Template.Spec.Containers[0].Image == image { + logger.Info("ConfigMap image Job is still running, requeueing", "job", jobName) + return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil + } + // Running Job is for a different image - delete it and create a new one. + propagation := metav1.DeletePropagationBackground + if err := h.client.Delete(ctx, existingJob, &client.DeleteOptions{PropagationPolicy: &propagation}); err != nil { + return ctrl.Result{}, fmt.Errorf("failed to delete old configMap image Job for image change: %w", err) + } + logger.Info("Deleted old ConfigMap image Job due to image change", "job", jobName, "newImage", image) + return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 5}, nil + } + + imagePullSecrets := []v1.LocalObjectReference{} + if len(devConfig.Spec.CommonConfig.ImageRegistrySecrets) > 0 { + imagePullSecrets = append(imagePullSecrets, devConfig.Spec.CommonConfig.ImageRegistrySecrets...) + } + + // Create a new Job. + serviceAccount := h.getServiceAccountName(ctx, devConfig) + backoffLimit := int32(4) + ttlSeconds := int32(5) + job := &batchv1.Job{ + ObjectMeta: metav1.ObjectMeta{ + Name: jobName, + Namespace: namespace, + OwnerReferences: []metav1.OwnerReference{ + { + APIVersion: devConfig.APIVersion, + Kind: devConfig.Kind, + Name: devConfig.Name, + UID: devConfig.UID, + Controller: ptr.To(true), + }, + }, + }, + Spec: batchv1.JobSpec{ + BackoffLimit: &backoffLimit, + TTLSecondsAfterFinished: &ttlSeconds, + Template: v1.PodTemplateSpec{ + Spec: v1.PodSpec{ + ServiceAccountName: serviceAccount, + RestartPolicy: v1.RestartPolicyOnFailure, + Containers: []v1.Container{ + { + Name: "configmap-applier", + Image: image, + Command: []string{"sh", "-c", + fmt.Sprintf( + "sed -e 's/[$]CM_NAME[$]/%s/g' -e 's/[$]CM_NAMESPACE[$]/%s/g' /remediation/configmap.yaml | "+ + "kubectl apply -f - && "+ + "kubectl annotate configmap %s -n %s %s=%s --overwrite", + configMapName, namespace, + configMapName, namespace, ConfigMapImageAnnotationKey, image), + }, + }, + }, + ImagePullSecrets: imagePullSecrets, + }, + }, + }, + } + + if err := h.client.Create(ctx, job); err != nil { + return ctrl.Result{}, fmt.Errorf("failed to create configMap image Job: %w", err) + } + + logger.Info("Created ConfigMap image Job", "job", jobName, "image", image) + return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil +}