Skip to content

[CP 1386] [CP 1363] handle reboot step failure in workflow#531

Open
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1386.rocm.gpu-operator.main
Open

[CP 1386] [CP 1363] handle reboot step failure in workflow#531
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1386.rocm.gpu-operator.main

Conversation

@ci-penbot-01
Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1386


Source PR Description (pensando/gpu-operator#1386):

When rebooting a remediation node, the reboot workflow step ran as a pod on the node being rebooted. This caused the pod to be killed mid-execution, which Argo interpreted as a step failure.

To fix this, the single reboot step has been split into two separate steps:

Reboot — issues the reboot command and exits gracefully before the node goes down.
WaitForNodeReady — waits for the rebooted node to come back online and report as ready.
This separation ensures the reboot pod shuts down cleanly and the workflow correctly tracks node recovery without false failures.

Cherrypick triggered by: ACP-Automation

* handle reboot step failure in workflow

* fix node selector and affinity rules

* update documentation

* use boot id to detect successful reboot

* ignore tests folder from docs lint

(cherry picked from commit 1c647b6169de2f320a4b5bf164a49f3450b44bbf)

Co-authored-by: Uday Bhaskar <udayb@amd.com>
(cherry picked from commit 2225463b9b2cdd61376c06deddd0a6bdde052a7b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant