Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,7 @@ NVPTX
Nano
Navi
Noncoherently
NoSchedule
NousResearch's
NumPy
OAM
Expand Down
56 changes: 43 additions & 13 deletions docs/autoremediation/auto-remediation.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,31 +88,61 @@ The GPU Operator installs Argo Workflows v4.0.3, using a [customized installatio

For OpenShift users: To use the auto remediation feature, additonal steps are required to install Argo Workflows to the OpenShift cluster, which requires special consideration:

1. **If using OpenShift AI Operator with CRD `DataScienceCluster`:** Argo Workflows are possibly already deployed by the OpenShift AI Operator, if the Custom Resource Definition (CRD) like workflows.argoproj.io is already existing, no additional installation is needed.
1. **If using OpenShift AI Operator with CRD `DataScienceCluster`:** Argo Workflows are possibly already deployed by the OpenShift AI Operator, if the CustomResourceDefinition like workflows.argoproj.io is already existing, no additional installation is needed.

2. **If not using OpenShift AI Operator:** Follow these steps to install Argo Workflows on your OpenShift cluster:
2. **If not using OpenShift AI Operator:** Follow these steps to install Argo Workflows on your OpenShift cluster:

a. Install CRDs (must be executed separately due to CRD size):

```bash
oc apply --server-side --force-conflicts -k "https://github.com/argoproj/argo-workflows/manifests/base/crds/full?ref=v4.0.3"
```
```bash
oc apply --server-side --force-conflicts -k "https://github.com/argoproj/argo-workflows/manifests/base/crds/full?ref=v4.0.3"
```

b. Add the Argo Helm repository:

```bash
helm repo add argo https://argoproj.github.io/argo-helm --force-update
```
```bash
helm repo add argo https://argoproj.github.io/argo-helm --force-update
```

c. Install Argo Workflows using Helm:

```bash
helm install argo-workflow argo/argo-workflows \
```bash
helm install argo-workflow argo/argo-workflows \
-n argo-workflow \
--create-namespace \
--version=1.0.6 \
--set crds.install=false
```
--set crds.install=false \
--set controller.instanceID.enabled=true \
--set controller.instanceID.explicitID=amd-gpu-operator-remediation-workflow \
--set 'controller.tolerations[0].key=amd-gpu-unhealthy' \
--set 'controller.tolerations[0].operator=Exists' \
--set 'controller.tolerations[0].effect=NoSchedule'
```

> **Important:** The `controller.instanceID.explicitID` value must be set to `amd-gpu-operator-remediation-workflow`. The GPU Operator labels every remediation workflow and workflow template it creates with `workflows.argoproj.io/controller-instanceid: amd-gpu-operator-remediation-workflow`. An Argo workflow-controller only reconciles workflows whose `controller-instanceid` label matches its configured `instanceID`, so without this setting the Helm-installed controller will silently ignore the operator's workflows. Refer to the [Argo Workflows controller `instanceID` documentation](https://argo-workflows.readthedocs.io/en/stable/scaling/#instanceid) and the [`argo-workflows` chart values](https://github.com/argoproj/argo-helm/blob/main/charts/argo-workflows/values.yaml) for full details.
>
> **Important:** The `controller.tolerations` entry for `amd-gpu-unhealthy:NoSchedule` is required. During remediation the GPU Operator taints the affected node with `amd-gpu-unhealthy:NoSchedule` (see [NodeRemediationTaints](#node-drain-policy-configuration)). If the workflow-controller pod happens to be scheduled on a node that later gets tainted, it will be evicted and remediation will stall. Adding this toleration ensures the controller keeps running on tainted nodes so it can continue driving the workflow to completion. The same toleration is applied to the in-tree workflow controller, metrics-exporter, and other operator-managed components.
>
> The **same toleration must also be added to the Kernel Module Management (KMM) operator's Helm chart** when KMM is installed separately on OpenShift. KMM is responsible for (re)building and loading the GPU driver kernel module on the node after a remediation reboot — if its controller pod cannot tolerate `amd-gpu-unhealthy:NoSchedule`, it may be evicted from a tainted node and the driver will never be reloaded, blocking the post-reboot validation step of the workflow. When installing the KMM operator via Helm, pass the equivalent flags (the exact key path depends on the KMM chart you use, e.g. `controller.manager.tolerations` for the upstream chart):
>
> ```bash
> --set 'controller.manager.tolerations[0].key=amd-gpu-unhealthy' \
> --set 'controller.manager.tolerations[0].operator=Exists' \
> --set 'controller.manager.tolerations[0].effect=NoSchedule'
> ```

If you prefer a values file over `--set`, the equivalent block is:

```yaml
controller:
instanceID:
enabled: true
explicitID: amd-gpu-operator-remediation-workflow
tolerations:
- key: amd-gpu-unhealthy
operator: Exists
effect: NoSchedule
```

## Configuration and customization

Expand Down Expand Up @@ -292,7 +322,7 @@ The following example demonstrates a complete error mapping configuration:

**notifyTestFailureMessage** - Contains instructions to be displayed when validation tests fail after remediation attempts. This message typically includes escalation procedures and diagnostic information requirements.

**recoveryPolicy** - Defines limits on remediation attempts to prevent excessive recovery cycles. Includes `maxAllowedRunsPerWindow` (maximum retry attempts) and `windowSize` (time window for counting attempts). When exceeded, the workflow pauses for manual intervention.
**recoveryPolicy** - Defines limits on remediation attempts to prevent excessive recovery cycles. Includes `maxAllowedRunsPerWindow` (maximum retry attempts) and `windowSize` (time window for counting attempts). Once the number of remediation workflows crosses `maxAllowedRunsPerWindow`, no new workflow is triggered for the same node condition within `windowSize`. After the window elapses, if the issue still persists, a new remediation workflow is allowed to start again.

**skipRebootStep** - Controls whether the node reboot step is executed during the remediation workflow. The default workflow template includes an automatic reboot step to reinitialize GPU hardware after performing the recommended remediation actions. Set this field to `true` to skip the reboot step when the node has already been rebooted manually as part of the remediation process or when a reboot is not desired for the specific error condition. Default value is `false`.

Expand Down
15 changes: 15 additions & 0 deletions internal/controllers/mock_remediation_handler.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

153 changes: 144 additions & 9 deletions internal/controllers/remediation_handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ import (
workflowv1alpha1 "github.com/argoproj/argo-workflows/v4/pkg/apis/workflow/v1alpha1"
appsv1 "k8s.io/api/apps/v1"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/client-go/kubernetes"
Expand Down Expand Up @@ -253,10 +254,27 @@ func (n *remediationMgr) HandleRemediation(ctx context.Context, devConfig *amdv1
// HandleDelete handles the delete operations during remediation process
func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1alpha1.DeviceConfig, nodeList *v1.NodeList) (res ctrl.Result, err error) {
res = ctrl.Result{Requeue: true, RequeueAfter: time.Second * 20}
remediationDisabled, err := n.helper.isRemediationDisabled(ctx, deviceConfig)
if err != nil || remediationDisabled {

if deviceConfig.Spec.RemediationWorkflow.Config == nil || deviceConfig.Spec.RemediationWorkflow.Config.Name == "" {
cfgMapName := deviceConfig.Name + "-" + DefaultConfigMapSuffix
if _, getErr := n.helper.getConfigMap(ctx, cfgMapName, deviceConfig.Namespace); getErr == nil {
if err := n.helper.deleteConfigMap(ctx, cfgMapName, deviceConfig.Namespace); err == nil {
log.FromContext(ctx).Info(fmt.Sprintf("Deleted ConfigMap: %s", cfgMapName))
} else {
log.FromContext(ctx).Error(err, fmt.Sprintf("Failed to delete ConfigMap: %s", cfgMapName))
}
}
}

crdPresent, err := n.helper.isWorkflowCRDPresent(ctx)
if err != nil {
log.FromContext(ctx).Error(err, "Failed to check if Argo Workflow CRD is present")
return res, err
}
if !crdPresent {
log.FromContext(ctx).Info("Argo Workflow CRD not found in the cluster, skipping workflow cleanup")
return res, nil
}

wfList, err := n.helper.getWorkflowList(ctx, deviceConfig.Namespace)
if err != nil {
Expand All @@ -271,13 +289,6 @@ func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1al
log.FromContext(ctx).Info(fmt.Sprintf("Deleted workflow: %s", wf.Name))
}

if deviceConfig.Spec.RemediationWorkflow.Config == nil || deviceConfig.Spec.RemediationWorkflow.Config.Name == "" {
cfgMapName := deviceConfig.Name + "-" + DefaultConfigMapSuffix
if err := n.helper.deleteConfigMap(ctx, cfgMapName, deviceConfig.Namespace); err == nil {
log.FromContext(ctx).Info(fmt.Sprintf("Deleted ConfigMap: %s", cfgMapName))
}
}

return
}

Expand All @@ -287,6 +298,7 @@ func (n *remediationMgr) HandleDelete(ctx context.Context, deviceConfig *amdv1al
type remediationMgrHelperAPI interface {
getServiceAccountName(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) string
isRemediationDisabled(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) (bool, error)
isWorkflowCRDPresent(ctx context.Context) (bool, error)
resumeSuspendedWorkflow(ctx context.Context, wfName, namespace string) error
isDriverUpgradeInProgress(devCfg *amdv1alpha1.DeviceConfig, node *v1.Node) bool
checkIfTaintExists(node *v1.Node, devConfig *amdv1alpha1.DeviceConfig, nodeCondition string) bool
Expand Down Expand Up @@ -410,6 +422,19 @@ func (h *remediationMgrHelper) isRemediationDisabled(ctx context.Context, devCon
return false, nil
}

// isWorkflowCRDPresent reports whether the Argo Workflow CRD is registered in the cluster.
// It uses the RESTMapper to detect the resource without requiring a cluster-scoped CRD GET.
func (h *remediationMgrHelper) isWorkflowCRDPresent(ctx context.Context) (bool, error) {
gvk := workflowv1alpha1.WorkflowSchemaGroupVersionKind
if _, err := h.client.RESTMapper().RESTMapping(gvk.GroupKind(), gvk.Version); err != nil {
if meta.IsNoMatchError(err) {
return false, nil
}
return false, err
}
return true, nil
}

func (h *remediationMgrHelper) validateUserConfigMap(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) error {
if devConfig.Spec.RemediationWorkflow.Config == nil || devConfig.Spec.RemediationWorkflow.Config.Name == "" {
return nil
Expand Down Expand Up @@ -1936,3 +1961,113 @@ func (h *remediationMgrHelper) handleDeviceConfigChanges(ctx context.Context, de
}
}
}

func (h *remediationMgrHelper) createConfigMapFromImage(ctx context.Context, devConfig *amdv1alpha1.DeviceConfig) (ctrl.Result, error) {
logger := log.FromContext(ctx)
image := devConfig.Spec.RemediationWorkflow.ConfigMapImage
if image == "" {
return ctrl.Result{}, nil
}

jobName := devConfig.Name + "-" + ConfigMapImageJobSuffix
namespace := devConfig.Namespace
configMapName := devConfig.Name + "-" + DefaultConfigMapSuffix

// Check if the ConfigMap already exists.
existingCM, cmErr := h.getConfigMap(ctx, configMapName, namespace)
if cmErr == nil {
// ConfigMap exists - check if it was created by the same image.
if existingCM.Annotations != nil && existingCM.Annotations[ConfigMapImageAnnotationKey] == image {
return ctrl.Result{}, nil
}
// Image changed - delete the stale ConfigMap so the new image can recreate it.
if err := h.deleteConfigMap(ctx, configMapName, namespace); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to delete stale configMap for image change: %w", err)
}
logger.Info("Deleted stale ConfigMap due to image change", "configMap", configMapName, "newImage", image)
}

// Check if a Job is already in-flight.
existingJob := &batchv1.Job{}
if err := h.client.Get(ctx, client.ObjectKey{Name: jobName, Namespace: namespace}, existingJob); err == nil {
for _, cond := range existingJob.Status.Conditions {
if cond.Type == batchv1.JobComplete && cond.Status == v1.ConditionTrue {
logger.Info("ConfigMap image Job completed successfully", "job", jobName)
return ctrl.Result{}, nil
}
if cond.Type == batchv1.JobFailed && cond.Status == v1.ConditionTrue {
return ctrl.Result{}, fmt.Errorf("configMap image Job %s failed: %s", jobName, cond.Message)
}
}
// Job still running - check if it's for the current image.
if len(existingJob.Spec.Template.Spec.Containers) > 0 &&
existingJob.Spec.Template.Spec.Containers[0].Image == image {
logger.Info("ConfigMap image Job is still running, requeueing", "job", jobName)
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil
}
// Running Job is for a different image - delete it and create a new one.
propagation := metav1.DeletePropagationBackground
if err := h.client.Delete(ctx, existingJob, &client.DeleteOptions{PropagationPolicy: &propagation}); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to delete old configMap image Job for image change: %w", err)
}
logger.Info("Deleted old ConfigMap image Job due to image change", "job", jobName, "newImage", image)
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 5}, nil
}

imagePullSecrets := []v1.LocalObjectReference{}
if len(devConfig.Spec.CommonConfig.ImageRegistrySecrets) > 0 {
imagePullSecrets = append(imagePullSecrets, devConfig.Spec.CommonConfig.ImageRegistrySecrets...)
}

// Create a new Job.
serviceAccount := h.getServiceAccountName(ctx, devConfig)
backoffLimit := int32(4)
ttlSeconds := int32(5)
job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: jobName,
Namespace: namespace,
OwnerReferences: []metav1.OwnerReference{
{
APIVersion: devConfig.APIVersion,
Kind: devConfig.Kind,
Name: devConfig.Name,
UID: devConfig.UID,
Controller: ptr.To(true),
},
},
},
Spec: batchv1.JobSpec{
BackoffLimit: &backoffLimit,
TTLSecondsAfterFinished: &ttlSeconds,
Template: v1.PodTemplateSpec{
Spec: v1.PodSpec{
ServiceAccountName: serviceAccount,
RestartPolicy: v1.RestartPolicyOnFailure,
Containers: []v1.Container{
{
Name: "configmap-applier",
Image: image,
Command: []string{"sh", "-c",
fmt.Sprintf(
"sed -e 's/[$]CM_NAME[$]/%s/g' -e 's/[$]CM_NAMESPACE[$]/%s/g' /remediation/configmap.yaml | "+
"kubectl apply -f - && "+
"kubectl annotate configmap %s -n %s %s=%s --overwrite",
configMapName, namespace,
configMapName, namespace, ConfigMapImageAnnotationKey, image),
},
},
},
ImagePullSecrets: imagePullSecrets,
},
},
},
}

if err := h.client.Create(ctx, job); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to create configMap image Job: %w", err)
}

logger.Info("Created ConfigMap image Job", "job", jobName, "image", image)
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil
}