[CP 1386] [CP 1363] handle reboot step failure in workflow by ci-penbot-01 · Pull Request #531 · ROCm/gpu-operator

ci-penbot-01 · 2026-04-23T05:20:25Z

cp of pensando/gpu-operator#1386

Source PR Description (pensando/gpu-operator#1386):

When rebooting a remediation node, the reboot workflow step ran as a pod on the node being rebooted. This caused the pod to be killed mid-execution, which Argo interpreted as a step failure.

To fix this, the single reboot step has been split into two separate steps:

Reboot — issues the reboot command and exits gracefully before the node goes down.
WaitForNodeReady — waits for the rebooted node to come back online and report as ready.
This separation ensures the reboot pod shuts down cleanly and the workflow correctly tracks node recovery without false failures.

Cherrypick triggered by: ACP-Automation

* handle reboot step failure in workflow * fix node selector and affinity rules * update documentation * use boot id to detect successful reboot * ignore tests folder from docs lint (cherry picked from commit 1c647b6169de2f320a4b5bf164a49f3450b44bbf) Co-authored-by: Uday Bhaskar <udayb@amd.com> (cherry picked from commit 2225463b9b2cdd61376c06deddd0a6bdde052a7b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CP 1386] [CP 1363] handle reboot step failure in workflow#531

[CP 1386] [CP 1363] handle reboot step failure in workflow#531
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1386.rocm.gpu-operator.main

ci-penbot-01 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ci-penbot-01 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant