Add Spark 4.0 support via deequ:2.0.14-spark-4.0 by m-aciek · Pull Request #259 · awslabs/python-deequ

m-aciek · 2026-03-26T15:16:12Z

Closes #258

Summary

Add "4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0" to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
Widen pyspark optional dep from >=2.4.7,<3.4.0 to >=2.4.7,<5.0.0 in pyproject.toml
Add Spark 4.0 to the CI matrix with Java 17; restructure matrix to use include: style so each Spark version carries its required Java version; normalize Spark 4 version string to "4.0" and rename PANDAS_VERSION env var to PANDAS_VERSION_SPEC for clarity
Replace scala.collection.JavaConversions (removed in Scala 2.13) with a version-aware converter layer in scala_utils.py: prefers scala.jdk.javaapi.CollectionConverters (Spark 4+ / Scala 2.13) and falls back to scala.collection.JavaConverters (Spark 3.x / Scala 2.12); detection is cached per JVM instance
Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with an empty Java list converted via to_scala_seq in analyzers.py and checks.py
Replace inline JavaConverters call in profiles.py with the existing scala_map_to_java_map helper
Add PythonUtils.toScalaMap fallback in to_scala_map: if the PySpark internal is ever removed in a future version, the function falls back to CollectionConverters/JavaConverters

Root causes fixed

Spark 4 uses Scala 2.13, which introduced breaking changes affecting pydeequ:

scala.collection.JavaConversions was removed — replaced by a version-aware layer using scala.jdk.javaapi.CollectionConverters (Scala 2.13) with a scala.collection.JavaConverters fallback (Scala 2.12)
scala.collection.Seq.empty() is not accessible via Py4J reflection — replaced with to_scala_seq(jvm, jvm.java.util.ArrayList())

Detection note: on Scala 2.12, jvm.scala.jdk.javaapi.CollectionConverters does not raise an exception — py4j resolves it to a JavaPackage placeholder. Version detection therefore probes by actually calling asScala() on an empty list; a TypeError from the unresolvable package triggers the fallback.

Test plan

All existing tests pass with pyspark==4.0 / Java 17
All existing tests pass with pyspark==3.5 / Java 17 (and earlier 3.x versions)
CI matrix extended to cover Spark 4.0 with Java 17
Existing Spark 3.x matrix entries unchanged

PR authored with assistance from Claude Code

m-aciek · 2026-04-10T09:03:55Z

This is now ready for review; CI tests pass on my fork: https://github.com/m-aciek/python-deequ/actions/runs/24196839467

chenliu0831

LGTM. I'm not sure if we would like to keep maintaining the Py4j approach though.

chenliu0831 · 2026-04-21T23:48:37Z

@m-aciek we need your commit to have verified signatures

- Add "4.0" entry to SPARK_TO_DEEQU_COORD_MAPPING in configs.py - Widen pyspark optional dep bound to <5.0.0 in pyproject.toml - Replace scala.collection.JavaConversions (removed in Scala 2.13) with JavaConverters in scala_utils.py and profiles.py - Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with to_scala_seq(jvm, jvm.java.util.ArrayList()) in analyzers.py and checks.py - Add Spark 4.0.0 to CI matrix with Java 17; use include: style to pair each Spark version with its required Java version - Fix CI for Spark 4.0: - use Python 3.9 and version-marker pyspark dep - use pip install instead of poetry add - install pandas>=2.0.0 required by PySpark 4.0 - Fix empty Seq compatibility across Scala 2.12 and 2.13 Fixes awslabs#258

m-aciek · 2026-04-22T20:07:39Z

@chenliu0831 Thank you for reviewing! I've setup the verification and squashed the commits

sudsali · 2026-04-27T22:01:50Z

The PR should be on a new branch following the deequ pattern of release to make sure we dont force everyone to use spark 4.0

m-aciek · 2026-04-27T22:29:02Z

@sudsali Would you be able to create a new branch that I could make a base for this PR?

github-actions

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 8c93b14f) — may not be fully accurate. Reply if this doesn't help.

github-actions · 2026-04-27T22:32:56Z

      fail-fast: false
      matrix:
        include:
          - PYSPARK_VERSION: "3.1.3"


Unresolved merge conflict markers (<<<<<<< spark-4-support, =======, >>>>>>> master) are present throughout this file. This will cause a YAML parse error and CI will not run. The conflict markers must be resolved before merging.

Fixed in b522bfd

github-actions · 2026-04-27T22:32:56Z

        Scala sequence
    """
-    return jvm.scala.collection.JavaConversions.iterableAsScalaIterable(iterable).toSeq()
+    return jvm.scala.collection.JavaConverters.iterableAsScalaIterableConverter(iterable).asScala().toSeq()


JavaConverters is itself deprecated since Scala 2.13 in favor of scala.jdk.CollectionConverters. While JavaConverters still exists in Scala 2.13, it may be removed in a future Scala version. Consider using scala.jdk.CollectionConverters for Spark 4+ (Scala 2.13) and keeping JavaConverters as a fallback for Spark 3.x (Scala 2.12), or at minimum document this as a known future risk.

Fixed in ae6367d

github-actions · 2026-04-27T22:32:56Z

+    return jvm.scala.collection.JavaConverters.iterableAsScalaIterableConverter(
+        jvm.java.util.ArrayList()
+    ).asScala().toList()



The docstring says "Uses JavaConverters.toList()" but the code actually calls .asScala().toList() on the result of iterableAsScalaIterableConverter. The docstring is misleading about which API is being used.

Fixed in ae6367d

github-actions · 2026-04-27T22:32:56Z

        """
        self._run_result = run
-        profile_map = self._jvm.scala.collection.JavaConversions.mapAsJavaMap(run.profiles())  # TODO from ScalaUtils
+        profile_map = self._jvm.scala.collection.JavaConverters.mapAsJavaMapConverter(run.profiles()).asJava()  # TODO from ScalaUtils


This is a direct inline call to JavaConverters.mapAsJavaMapConverter(...).asJava() instead of using the existing scala_map_to_java_map helper from scala_utils.py (which was updated in this same PR). Use scala_map_to_java_map(self._jvm, run.profiles()) for consistency and to avoid duplicating the conversion logic.

Fixed in b0c4b07

sudsali · 2026-04-27T22:36:11Z

@m-aciek Please address all the comments.

… 4+) JavaConverters is deprecated in Scala 2.13 and may be removed in a future version. Detect the Scala version at runtime and prefer scala.jdk.javaapi.CollectionConverters (Spark 4+), falling back to scala.collection.JavaConverters (Spark 3.x). Also fix misleading docstring in empty_scala_seq. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_to_java_map helper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

m-aciek · 2026-04-27T22:51:04Z

@sudsali It's done now.

github-actions

Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 8c93b14f) — may not be fully accurate. Reply if this doesn't help.

github-actions · 2026-04-27T22:57:45Z

        Scala sequence
    """
-    return jvm.scala.collection.JavaConversions.iterableAsScalaIterable(iterable).toSeq()
+    style, converters = _get_converters(jvm)


For Scala 2.12 (Spark 3.x), JavaConverters uses iterableAsScalaIterableConverter(...).asScala() pattern, but the original code used JavaConversions.iterableAsScalaIterable(...) which returns the Scala iterable directly. The JavaConverters equivalent should work, but note that JavaConversions was deprecated in 2.12 and removed in 2.13, while JavaConverters was deprecated in 2.13. This legacy branch is fine for 2.12 but the comment in _get_converters says scala.collection.JavaConverters (Scala 2.12, Spark 3.x) — this is accurate, just noting for completeness.

github-actions · 2026-04-27T22:57:45Z

@@ -93,11 +128,17 @@ def to_scala_map(spark_session, d):



to_scala_map still uses spark_session._jvm.PythonUtils.toScalaMap(d) which relies on PythonUtils — a PySpark internal helper. In Spark 4.0, PythonUtils may have moved or changed. If this breaks on Spark 4, it should be updated to use the same _get_converters pattern. Was this path tested with Spark 4.0?

Changed to try PythonUtils first in 7275937.

github-actions · 2026-04-27T22:57:45Z

          - PYSPARK_VERSION: "3.5"
            PYTHON_VERSION: "3.9"
            JAVA_VERSION: "17"
+          - PYSPARK_VERSION: "4.0.0"


The matrix entry uses PYSPARK_VERSION: "4.0.0" (three-part version) while all other entries use two-part versions like "3.5". In the test step, pip install pyspark==$SPARK_VERSION will work either way, but _extract_major_minor_versions in configs.py will extract "4.0" from "4.0.0", so the SPARK_VERSION env var works. However, this inconsistency is confusing — consider using "4.0" to match the other entries, or document why the full version is needed.

Fixed in 7275937.

github-actions · 2026-04-27T22:57:45Z

+          - PYSPARK_VERSION: "4.0.0"
+            PYTHON_VERSION: "3.9"
+            JAVA_VERSION: "17"
+            PANDAS_VERSION: ">=2.0.0"


PANDAS_VERSION is only defined for the Spark 4.0 matrix entry. Other matrix entries won't have it set, so the if [ -n "$PANDAS_VERSION" ] guard in the run step handles that. However, the PANDAS_VERSION value is ">=2.0.0" which is a version specifier, not a version — the env var name is misleading. Consider renaming to PANDAS_VERSION_SPEC or similar for clarity.

Improved in 7275937.

github-actions · 2026-04-27T22:57:45Z

          pip install --upgrade pip
          pip install poetry==1.7.1
          poetry install
          poetry run pip install pyspark==$SPARK_VERSION


The poetry run pip install pyspark==$SPARK_VERSION line uses SPARK_VERSION which for the 4.0 entry is "4.0.0". This works for pip (pyspark==4.0.0), but for Spark 3.x entries like "3.5", pip will install the latest 3.5.x patch. This is existing behavior but worth noting the inconsistency with the new entry.

Fixed in 7275937.

…me PANDAS_VERSION_SPEC - to_scala_map: add CollectionConverters fallback in case PythonUtils is removed in a future Spark/PySpark version - base.yml: use "4.0" (two-part) for PYSPARK_VERSION to match other matrix entries - base.yml: rename PANDAS_VERSION → PANDAS_VERSION_SPEC to reflect that the value is a version specifier (e.g. ">=2.0.0"), not a bare version number Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…orrectly On Scala 2.12, jvm.scala.jdk.javaapi.CollectionConverters resolves to a JavaPackage placeholder rather than raising an exception, so the previous try/except always selected the jdk path and broke Spark 3.x. Verify usability by actually calling asScala() as a one-time probe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

m-aciek force-pushed the spark-4-support branch from c5c7f66 to e55d6e0 Compare April 8, 2026 14:16

chenliu0831 previously approved these changes Apr 15, 2026

View reviewed changes

m-aciek force-pushed the spark-4-support branch from d80a2cd to 6aa90c3 Compare April 22, 2026 19:59

Merge branch 'master' into spark-4-support

6d9ec2d

m-aciek dismissed chenliu0831’s stale review via 6d9ec2d April 27, 2026 22:31

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

sudsali changed the base branch from master to release/1.0.0-spark-4.0 April 27, 2026 22:34

m-aciek and others added 3 commits April 28, 2026 00:36

fix: merge commit issues

b522bfd

fix: replace inline JavaConverters call in profiles.py with scala_map…

b0c4b07

…_to_java_map helper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

m-aciek and others added 2 commits April 28, 2026 10:31

sudsali merged commit 85c325b into awslabs:release/1.0.0-spark-4.0 Apr 28, 2026
2 checks passed

m-aciek deleted the spark-4-support branch April 28, 2026 19:52

sudsali mentioned this pull request Apr 28, 2026

Add Spark 4.0 support via deequ:2.0.14-spark-4.0 #258

Closed

Conversation

m-aciek commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root causes fixed

Test plan

Uh oh!

m-aciek commented Apr 10, 2026

Uh oh!

chenliu0831 left a comment

Choose a reason for hiding this comment

Uh oh!

chenliu0831 commented Apr 21, 2026

Uh oh!

m-aciek commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudsali commented Apr 27, 2026

Uh oh!

m-aciek commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudsali commented Apr 27, 2026

Uh oh!

m-aciek commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

m-aciek commented Mar 26, 2026 •

edited

Loading

m-aciek commented Apr 22, 2026 •

edited

Loading