Skip to content

Conversation

@kryanbeane
Copy link
Contributor

Issue link

https://issues.redhat.com/browse/RHOAIENG-39073

What changes have been made

Add support to configure a priority class label for Ray Jobs enabling Kueue preemption for jobs

Verification steps

Prerequisites

  • RHBoK running
  • RayJob integration enabled
  • kubectl/oc access

Setup

1. Create WorkloadPriorityClasses

kubectl apply -f - <<EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: high-priority
value: 1000
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: low-priority
value: 100
EOF

2. Create ClusterQueue and LocalQueue

kubectl apply -f - <<EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: preemption-test-queue
spec:
  resourceGroups:
    - coveredResources: ["cpu", "memory"]
      flavors:
        - name: default
          resources:
            - name: "cpu"
              nominalQuota: 3
            - name: "memory"
              nominalQuota: 18Gi
  preemption:
    reclaimWithinCohort: Never
    withinClusterQueue: Allow
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: preemption-test-local-queue
  namespace: <namespace>
spec:
  clusterQueue: preemption-test-queue
EOF

Test

from codeflare_sdk import RayJob, ManagedClusterConfig
import time

cluster_config = ManagedClusterConfig(
    head_cpu_requests='1',
    head_cpu_limits='2',
    head_memory_requests=6,
    head_memory_limits=8,
    num_workers=1,
    worker_cpu_requests='1',
    worker_cpu_limits='2',
    worker_memory_requests=5,
    worker_memory_limits=8,
)

# Submit low priority job
low_job = RayJob(
    job_name="low-priority-job",
    entrypoint="python -c 'import time; time.sleep(60)'",
    cluster_config=cluster_config,
    namespace="<namespace>",
    local_queue="preemption-test-local-queue",
    priority_class="low-priority"
)
low_job.submit()

# Wait and verify running
low_job.status()

# Submit high priority job
high_job = RayJob(
    job_name="high-priority-job",
    entrypoint="python -c 'import time; time.sleep(60)'",
    cluster_config=cluster_config,
    namespace="<namespace>",
    local_queue="preemption-test-local-queue",
    priority_class="high-priority"
)
high_job.submit()

# Verify preemption after 30s
low_job.status()
high_job.status()

Verify labels:

kubectl get rayjob low-priority-job high-priority-job -n <namespace> -o yaml | grep -A 3 labels

Expected:

  • Both jobs have kueue.x-k8s.io/queue-name: preemption-test-local-queue
  • Low priority job has kueue.x-k8s.io/priority-class: low-priority
  • High priority job has kueue.x-k8s.io/priority-class: high-priority

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 25, 2025

@kryanbeane: This pull request references RHOAIENG-39073 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Issue link

https://issues.redhat.com/browse/RHOAIENG-39073

What changes have been made

Add support to configure a priority class label for Ray Jobs enabling Kueue preemption for jobs

Verification steps

Prerequisites

  • RHBoK running
  • RayJob integration enabled
  • kubectl/oc access

Setup

1. Create WorkloadPriorityClasses

kubectl apply -f - <<EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
 name: high-priority
value: 1000
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
 name: low-priority
value: 100
EOF

2. Create ClusterQueue and LocalQueue

kubectl apply -f - <<EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
 name: preemption-test-queue
spec:
 resourceGroups:
   - coveredResources: ["cpu", "memory"]
     flavors:
       - name: default
         resources:
           - name: "cpu"
             nominalQuota: 3
           - name: "memory"
             nominalQuota: 18Gi
 preemption:
   reclaimWithinCohort: Never
   withinClusterQueue: Allow
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
 name: preemption-test-local-queue
 namespace: <namespace>
spec:
 clusterQueue: preemption-test-queue
EOF

Test

from codeflare_sdk import RayJob, ManagedClusterConfig
import time

cluster_config = ManagedClusterConfig(
   head_cpu_requests='1',
   head_cpu_limits='2',
   head_memory_requests=6,
   head_memory_limits=8,
   num_workers=1,
   worker_cpu_requests='1',
   worker_cpu_limits='2',
   worker_memory_requests=5,
   worker_memory_limits=8,
)

# Submit low priority job
low_job = RayJob(
   job_name="low-priority-job",
   entrypoint="python -c 'import time; time.sleep(60)'",
   cluster_config=cluster_config,
   namespace="<namespace>",
   local_queue="preemption-test-local-queue",
   priority_class="low-priority"
)
low_job.submit()

# Wait and verify running
low_job.status()

# Submit high priority job
high_job = RayJob(
   job_name="high-priority-job",
   entrypoint="python -c 'import time; time.sleep(60)'",
   cluster_config=cluster_config,
   namespace="<namespace>",
   local_queue="preemption-test-local-queue",
   priority_class="high-priority"
)
high_job.submit()

# Verify preemption after 30s
low_job.status()
high_job.status()

Verify labels:

kubectl get rayjob low-priority-job high-priority-job -n <namespace> -o yaml | grep -A 3 labels

Expected:

  • Both jobs have kueue.x-k8s.io/queue-name: preemption-test-local-queue
  • Low priority job has kueue.x-k8s.io/priority-class: low-priority
  • High priority job has kueue.x-k8s.io/priority-class: high-priority

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from szaher November 25, 2025 16:17
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.21%. Comparing base (8eac545) to head (634a03f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #946      +/-   ##
==========================================
+ Coverage   94.13%   94.21%   +0.08%     
==========================================
  Files          24       24              
  Lines        2096     2128      +32     
==========================================
+ Hits         1973     2005      +32     
  Misses        123      123              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kryanbeane kryanbeane force-pushed the RHOAIENG-38992 branch 3 times, most recently from 0c54071 to 5e50ded Compare November 26, 2025 09:34
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pawelpaszki

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 26, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 158224f into project-codeflare:main Nov 26, 2025
20 checks passed
@kryanbeane kryanbeane deleted the RHOAIENG-38992 branch November 26, 2025 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants