Skip to content

DRA and e2e fixes#536

Merged
sajmera-pensando merged 3 commits intoROCm:mainfrom
bhatnitish:rocm-cherry-picks
Apr 24, 2026
Merged

DRA and e2e fixes#536
sajmera-pensando merged 3 commits intoROCm:mainfrom
bhatnitish:rocm-cherry-picks

Conversation

@bhatnitish
Copy link
Copy Markdown
Contributor

No description provided.

* Add DeviceConfig collection in testmonitor

* Fix DRA Tests
Add ServiceAccount, ClusterRole, and ClusterRoleBinding for the DRA
driver so it can run on OpenShift clusters. The ClusterRole grants:
- privileged SCC (required for OpenShift)
- resourceslices CRUD (to publish GPU resources)
- resourceclaims get (to process allocation requests)
- nodes get (to look up node info for ResourceSlice ownership)

Also add the DRA driver service account to the OLM bundle's
extra-service-accounts list so OLM-managed installs create the SA.
# Conflicts:
#	bundle/manifests/amd-gpu-operator.clusterserviceversion.yaml
…d (#1388)

* Create DeviceClass from operator code on OpenShift when DRA is enabled

On OpenShift, operator-sdk cannot deploy DeviceClass resources via the
OLM bundle. This adds handleDeviceClass to the reconciler which creates
the gpu.amd.com DeviceClass using an unstructured client when running on
OpenShift with DRA driver enabled. The DeviceClass is cluster-scoped and
shared, so it is created once (AlreadyExists is handled gracefully) and
never deleted on DeviceConfig finalization.

* Use deviceClassName constant instead of hardcoded string

Address review feedback: extract "gpu.amd.com" into a const and use it
throughout handleDeviceClass.
@sajmera-pensando sajmera-pensando merged commit 0296a94 into ROCm:main Apr 24, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants