Skip to content

[GH-2609] Support Spark 4.1#2649

Draft
jiayuasu wants to merge 11 commits intomasterfrom
support-spark-4.1
Draft

[GH-2609] Support Spark 4.1#2649
jiayuasu wants to merge 11 commits intomasterfrom
support-spark-4.1

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 12, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

This PR adds support for Apache Spark 4.1 in Sedona.

Build scaffolding

  • Added sedona-spark-4.1 Maven profile in root pom.xml (Spark 4.1.1, Scala 2.13.17, Hadoop 3.4.1)
  • Added spark-4.1 module entry in spark/pom.xml (enable-all-submodules profile)
  • Added sedona-spark-4.1 profile in spark/common/pom.xml with spark-sql-api dependency
  • Created spark/spark-4.1/ module (copied from spark/spark-4.0/, updated artifactId)
  • Fixed Scala version mismatch: updated scala2.13 and sedona-spark-4.0 profiles to use Scala 2.13.17

Spark 4.1 API compatibility fixes

  • ParquetColumnVector.java: Changed setAllNull() to reflection-based markAllNull() that works with both setAllNull (Spark less than 4.1) and setMissing (Spark 4.1+)
  • Functions.scala: Added explicit org.locationtech.jts.geom.Geometry import to resolve ambiguity with new org.apache.spark.sql.functions.Geometry in Spark 4.1
  • SedonaArrowEvalPythonExec.scala (spark-4.1 only): Added sessionUUID parameter required by Spark 4.1 ArrowEvalPythonExec

SPARK-52671 UDT workaround

Spark 4.1 changed RowEncoder.encoderForDataType to call udt.getClass directly instead of looking up via UDTRegistration. For Scala case object UDTs, getClass returns the module class (e.g., GeometryUDT$) which has a private constructor, causing ScalaReflectionException.

Fix: Added apply() factory methods to all three UDT case objects (GeometryUDT, GeographyUDT, RasterUDT) and replaced bare singleton references with UDT() calls across source files so that schema construction uses proper class instances.

Python support

  • Bumped pyspark upper bound from <4.1.0 to <4.2.0 in python/pyproject.toml (3 locations)

CI workflows

  • java.yml: Added Spark 4.1.1 + Scala 2.13.17 + JDK 17 matrix entries (both compile and unit-test jobs)
  • python.yml: Added Spark 4.1.1 matrix entry
  • example.yml: Added Spark 4.1 matrix entry

Documentation

  • Updated docs/setup/maven-coordinates.md with Spark 4.1 artifact coordinates
  • Updated docs/setup/platform.md compatibility table (Spark 4.1 requires Scala 2.13 and Python 3.10+)
  • Updated docs/community/publish.md release checklist

How was this PR tested?

  • All 6 Spark/Scala/JDK combinations compile successfully:
    • Spark 3.4 + Scala 2.12 + JDK 11
    • Spark 3.4 + Scala 2.13 + JDK 11
    • Spark 3.5 + Scala 2.12 + JDK 11
    • Spark 4.0 + Scala 2.13 + JDK 17
    • Spark 4.1 + Scala 2.13 + JDK 17
  • Unit tests for Spark 4.1: 1610 passed, 44 failed (all GEOSPATIAL_DISABLED - Spark 4.1 built-in geospatial functions shadow Sedona; separate issue)

Key files changed

Area Files
Build pom.xml, spark/pom.xml, spark/common/pom.xml, spark/spark-4.1/pom.xml
Spark 4.1 module spark/spark-4.1/src/ (all files)
UDT workaround GeometryUDT.scala, GeographyUDT.scala, RasterUDT.scala, plus ~50 files with schema changes
API compat ParquetColumnVector.java, Functions.scala
Python python/pyproject.toml
CI .github/workflows/java.yml, python.yml, example.yml
Docs docs/setup/maven-coordinates.md, platform.md, docs/community/publish.md

@github-actions github-actions bot added docs sedona-spark github_actions Pull requests that update GitHub Actions code root labels Feb 12, 2026
- Add sedona-spark-4.1 Maven profile (Spark 4.1.0, Scala 2.13.17, Hadoop 3.4.1)
- Create spark/spark-4.1 module based on spark-4.0
- Fix Geometry import ambiguity (Spark 4.1 adds o.a.s.sql.types.Geometry)
- Fix WritableColumnVector.setAllNull() removal (replaced by setMissing() in 4.1)
- Add sessionUUID parameter to ArrowPythonWithNamedArgumentRunner (new in 4.1)
- Update docs (maven-coordinates, platform, publish)
- Update CI workflows (java, example, python, docker-build)
Spark 4.1's RowEncoder calls udt.getClass directly, which returns the
Scala module class (e.g. GeometryUDT$) with a private constructor for
case objects, causing EXPRESSION_DECODING_FAILED errors.

Fix: Add apply() method to GeometryUDT, GeographyUDT, and RasterUDT
case objects that return new class instances, and use UDT() instead of
the bare singleton throughout schema construction code. This ensures
getClass returns the public class with an accessible constructor.

Also:
- Revert docker-build.yml (no Spark 4.1 in Docker builds)
- Bump pyspark upper bound from <4.1.0 to <4.2.0
- Bump Spark 4.1.0 to 4.1.1 in CI and POM
- Fix Scala 2.13.12 vs 2.13.17 mismatch in scala2.13 profile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs github_actions Pull requests that update GitHub Actions code root sedona-python sedona-spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Spark 4.1

1 participant