Skip to content

Conversation

@everpeace
Copy link
Contributor

@everpeace everpeace commented Oct 15, 2025

The PR is a successor of #4032


This PR depends on spark-operator changes:


What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces kubeflow/spark-operator's SparkApplication integration🎉

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: sample-sparkapp
  labels:
    kueue.x-k8s.io/queue-name: queue

IMPORTANT NOTICE: no elastic jobs support yet. This means you cannot use dynamicAllocation:

#
# ❌ SparkApplication with dynamicAllocation enabled
#    will be rejected.
#
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: sample-sparkapp
  labels:
    kueue.x-k8s.io/queue-name: queue
spec:
  dynamicAllocation:
    enabled: true
    minExecutors: 1
    initialExecutor: 3
    maxExecutors: 10

Which issue(s) this PR fixes:

Special notes for your reviewer:

This PR still depends on the unreleased spark-operator's main branch. We NEED to replace dependency with its release tag once spark-operator cuts the next release (it expects to be in the next week)

How to try manually

Setup

Build kueue image first

$ make image-build PLATFORMS=linux/arm64

# If you're on Mac
$ make image-build PLATFORMS=linux/arm64

Kind cluster with Kueue and dependent controllers:

$ echo "n" | make test-e2e E2E_RUN_ONLY_ENV=true KIND_CLUSTER_NAME=sparkapp-integration \
  && docker pull spark:4.0.0 && kind load docker-image --name sparkapp-integration spark:4.0.0
...
Skipping cleanup for kind cluster.

Kind cluster cleanup:
 kind delete cluster --name sparkapp-integration
4.0.0: Pulling from library/spark
...
Image: "spark:4.0.0" with ID "sha256:94f4e45d53db87ce2439b627f02d514ce235c563b17a821f05357ea52180dd2f" not yet present on node "sparkapp-integration-worker2", loading...

Common resources(ResourceFlavor, ClusterQueue, LocalQueue, ServiceAccount for spark):

kubectl --context=kind-sparkapp-integration apply -f - <<EOT
$ kubectl --context=kind-sparkapp-integration apply -f - <<EOT
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: default
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
    - coveredResources:
        - cpu
        - memory
      flavors:
        - name: default
          resources:
            - name: cpu
              nominalQuota: 4
            - name: memory
              nominalQuota: 4Gi
  preemption:
    reclaimWithinCohort: LowerPriority
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: queue
  namespace: default
spec:
  clusterQueue: cluster-queue
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-operator-spark-edit
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
  - kind: ServiceAccount
    name: spark
    namespace: default
EOT
Try
# This SparkApplication spawns 1 driver pods,
# And the driver pods spawns 2 executor pods.
$ kubectl --context=kind-sparkapp-integration apply -f - <<EOT
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: queue
spec:
  type: Scala
  mode: cluster
  image: spark:4.0.0
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
  arguments:
  - "50000"
  sparkVersion: 4.0.0
  memoryOverheadFactor: "0"     # spark adds extra memory on memory limits
                                # for non-JVM tasks. 0 can avoid it.
  driver:
    coreRequest: "1"
    memory: 1g                  # In Java format (e.g. 512m, 2g)
    serviceAccount: spark
  executor:
    instances: 2
    coreRequest: "1"
    memory: 1g                  # In Java format (e.g. 512m, 2g)
    deleteOnTermination: false  # to keep terminated executor pods for demo purpose
    serviceAccount: spark
EOT

$ watch -d -n 1 kubectl --context=kind-sparkapp-integration get workloads.kueue.x-k8s.io,sparkapp,pods -o wide
...
... You can see 
... - SparkAppliation will be created and admitted by Workload
... - driver pods are created, and then 2 executor pods are created
... - after 10-20 seconds
... - SparkApplication are marked 'COMPLETED' and its Workload will be marked as finished
Cleanup
kind delete cluster --name sparkapp-integration

Does this PR introduce a user-facing change?

Support kubeflow/spark-operator's SparkApplication integration without Elastic Job Support (i.e. no support Dynamic Allocation in SparkApplication)

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: everpeace
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 15, 2025
@netlify
Copy link

netlify bot commented Oct 15, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit bf4677a
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690fe95e56d7c2000891300e

@everpeace everpeace changed the title WIP: Kubeflow SprkApplication integration WIP: Kubeflow SparkApplication integration Oct 15, 2025
@everpeace everpeace changed the title WIP: Kubeflow SparkApplication integration Kubeflow SparkApplication integration Oct 15, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2025
@everpeace everpeace force-pushed the sprk-application branch 2 times, most recently from 4000068 to 5082895 Compare November 7, 2025 23:41
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 7, 2025
@everpeace

This comment was marked as off-topic.

@everpeace

This comment was marked as off-topic.

@everpeace
Copy link
Contributor Author

/test all

@k8s-ci-robot
Copy link
Contributor

@everpeace: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main bf4677a link true /test pull-kueue-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants