Skip to content

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 10, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

This PR enables libpostal model data (spark.sedona.libpostal.dataDir) to be loaded from remote filesystems such as HDFS, S3 (s3a://), GCS (gs://), and ABFS (abfs://), in addition to local paths.

Problem

jpostal requires the ~2 GB libpostal model data to reside on the local filesystem. In cloud deployments, users must manually pre-install the data on every executor node, which is operationally cumbersome.

Solution

Leverages Spark's native file distribution mechanism (SparkContext.addFile / SparkFiles.get) to distribute remote data to executors. Users call sc.addFile(remotePath, recursive = true) once before running queries, and Sedona resolves the data locally via SparkFiles.get(basename).

  • LibPostalDataLoader: New object that resolves remote paths via SparkFiles.get(). For local paths and file:// URIs, values are passed through unchanged (with file: URIs normalized to plain paths). For remote URIs, the basename is extracted and looked up through SparkFiles. Uses case-insensitive scheme comparison for file: URIs.
  • LibPostalUtils: Modified to call LibPostalDataLoader.resolveDataDir() before jpostal init, and to disable jpostal's built-in download when using a remote path.
  • Clear error messages: If remote data has not been distributed via sc.addFile, an IllegalStateException is thrown with instructions to call sc.addFile(remotePath, recursive = true).

Files changed

File Change
spark/common/.../expressions/LibPostalDataLoader.scala New — remote path resolution via SparkFiles, local path normalization
spark/common/.../expressions/LibPostalUtils.scala Modified — call LibPostalDataLoader.resolveDataDir() before jpostal init
docs/api/sql/Function.md Modified — document remote URI support, default data directory, data preparation instructions, and sc.addFile usage for ExpandAddress and ParseAddress
spark/common/.../sql/LibPostalDataLoaderTest.scala New — 15 tests for path detection and SparkFiles resolution

How was this patch tested?

  • 15 unit tests covering:
    • isRemotePath — 11 URI schemes (local, relative, file://, hdfs:// with/without host, s3a://, gs://, abfs://, wasb://, empty, Windows)
    • resolveDataDir — local path unchanged, file:// URI normalization
    • SparkFiles error — IllegalStateException with clear sc.addFile(..., recursive = true) instructions when data not found
    • SparkFiles resolution — places mock libpostal data (with subdirectories) directly in the SparkFiles root directory, verifies resolveDataDir resolves a remote URI to the correct local path with all subdirectories intact
  • All 15 tests pass: mvn test -pl spark/common -Dspark=3.5 -Dscala=2.12 -Dtest=none -DfailIfNoTests=false -DwildcardSuites="org.apache.sedona.sql.LibPostalDataLoaderTest"

Did this PR include necessary documentation updates?

  • Yes. Updated docs/api/sql/Function.md for both ExpandAddress and ParseAddress to document the default data directory (<java.io.tmpdir>/libpostal/), spark.sedona.libpostal.dataDir configuration, data preparation instructions (libpostal install + upload to HDFS/S3), and instructions for using sc.addFile with remote paths.

@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from 7d39f23 to 18190fd Compare February 10, 2026 09:27
@jiayuasu jiayuasu added this to the sedona-1.9.0 milestone Feb 10, 2026
@jiayuasu jiayuasu requested a review from Copilot February 10, 2026 09:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for resolving libpostal model data directories from remote Hadoop-compatible filesystems by copying them into a per-node local cache before initializing jpostal.

Changes:

  • Introduces HadoopFileSystemUtils to copy files/directories from remote FS to local, and reuses it from GeoPackage utilities.
  • Adds LibPostalDataLoader to detect remote URIs, download to a hashed local cache directory, and guard concurrent downloads with per-key locks.
  • Updates libpostal initialization and documentation; adds unit tests using MiniDFSCluster.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/sedona/sql/utils/HadoopFileSystemUtils.scala New shared Hadoop FS → local copy helpers for files and directory trees.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala New remote libpostal dataDir resolver with local caching and concurrency control.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalUtils.scala Resolves dataDir via loader and disables jpostal download for remote URIs.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/geopackage/connection/FileSystemUtils.scala Refactors GeoPackage copy utility to delegate to shared Hadoop FS helper.
spark/common/src/test/scala/org/apache/sedona/sql/HadoopFileSystemUtilsTest.scala New tests for local/remote detection and HDFS copy behaviors.
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala New tests for remote-path detection, caching behavior, and concurrent access.
docs/api/sql/Function.md Documents remote URI support for libpostal dataDir configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from 18190fd to 8464efc Compare February 10, 2026 09:37
@jiayuasu jiayuasu requested a review from Copilot February 10, 2026 09:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as draft February 10, 2026 09:43
@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch 2 times, most recently from bea9f8c to 6f7d89d Compare February 10, 2026 10:09
@jiayuasu jiayuasu requested a review from Copilot February 10, 2026 10:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch 2 times, most recently from e2059a2 to c7b509a Compare February 11, 2026 07:45
@jiayuasu jiayuasu requested a review from Copilot February 11, 2026 07:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from c7b509a to fa29ea9 Compare February 11, 2026 07:55
@jiayuasu jiayuasu marked this pull request as ready for review February 11, 2026 07:57
@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from fa29ea9 to 74fe0f0 Compare February 11, 2026 08:14
@jiayuasu jiayuasu changed the title [SEDONA-2360] Support fetching libpostal model data from HDFS/object store [GH-2360] Support fetching libpostal model data from HDFS/object store Feb 11, 2026
@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from 74fe0f0 to 41c1ac9 Compare February 11, 2026 08:53
@jiayuasu jiayuasu requested a review from Copilot February 11, 2026 09:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from 41c1ac9 to 16272c2 Compare February 11, 2026 09:55
@jiayuasu jiayuasu marked this pull request as draft February 11, 2026 10:23
@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from 16272c2 to b9a8d2c Compare February 11, 2026 11:00
…store

- Add shared HadoopFileSystemUtils with copyFileToLocal and copyDirectoryToLocal
- Refactor GeoPackage FileSystemUtils to delegate to shared utility
- Add LibPostalDataLoader to resolve remote dataDir (HDFS, S3, GCS, ABFS) to local cache
- Update LibPostalUtils to use LibPostalDataLoader for remote path resolution
- Disable jpostal auto-download when data is fetched from remote store
- Add tests for HadoopFileSystemUtils and LibPostalDataLoader using MiniDFSCluster
- Update docs for ExpandAddress and ParseAddress with remote dataDir instructions
@jiayuasu jiayuasu force-pushed the feature/2360-libpostal-remote-datadir branch from b9a8d2c to 2a55427 Compare February 11, 2026 18:33
## ExpandAddress

Introduction: Returns an array of expanded forms of the input address string. This is backed by the [libpostal](https://github.com/openvenues/libpostal) library's address expanding functionality.
Introduction: Returns an array of expanded forms of the input address string. This is backed by the [libpostal](https://github.com/openvenues/libpostal) library's address expanding functionality. Jpostal requires at least 2 GB of free disk space to store the data files used for address parsing and expanding. By default, the data files are downloaded automatically to a temporary directory (`<java.io.tmpdir>/libpostal/`, e.g. `/tmp/libpostal/` on Linux/macOS) when the function is called for the first time. The version of jpostal installed with this package only supports Linux and macOS. If you are using Windows, you will need to install libjpostal and libpostal manually and ensure that they are available in your `java.library.path`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be in a note, not the intro

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i intendtionally removed it since there are too many notes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now I bring it back

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe they dont need to be notes but they shouldnt be part of the intro


!!!Note
Jpostal requires at least 2 GB of free disk space to store the data files used for address parsing and expanding. The data files are downloaded automatically when the function is called for the first time.
The data directory can be configured via `spark.sedona.libpostal.dataDir`. You can point it to a remote filesystem path (HDFS, S3, GCS, ABFS, etc.) such as `hdfs:///data/libpostal/` or `s3a://my-bucket/libpostal/`. When using a remote path, you must distribute the data to all executors before running queries by calling `sc.addFile("hdfs:///data/libpostal/", recursive=True)` (PySpark) or `sc.addFile("hdfs:///data/libpostal/", recursive = true)` (Scala). In this remote-URI mode, the automatic internet download performed by jpostal is disabled, so the remote directory must already contain the libpostal model files. For local filesystem paths, jpostal's download-if-needed behavior remains enabled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldnt this be automatically distirbuted as part of the implementation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sc.addFile needs to be called from the driver before tasks run on executors
It requires access to SparkContext, which isn't readily available from within a UDF/expression

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle this when we initialize the spark session with sedona?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to handle that because no everybody needs libpostal so it has to be optional.

But if it requires user to opt in, then it is no different than a sc.addFile call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. let me think about this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a two pronged approach would work:

  1. add an optmizer rule that performs the sc.addFile call when it sees a libpostal method in the plan
  2. in the execution itself we also try to perform a sc.addFile call to account for the cosntantFolding case.

Not totally sure if ive missed something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated sc.addFile per query is dangerous and we might end up handling race condition. I think this over complicates the problem. I would rather not have this feature at all.


!!!Note
The version of jpostal installed with this package only supports Linux and MacOS. If you are using Windows, you will need to install libjpostal and libpostal manually and ensure that they are available in your `java.library.path`.
To prepare the libpostal data for a remote filesystem, first download it to a local machine by following the [libpostal installation instructions](https://github.com/openvenues/libpostal#installation-maclinux). After installation, the data files will be in the directory you specified during setup (commonly `/tmp/libpostal/`). Then upload them to your remote storage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure if "how to upload data to s3" is a neccessary part of our docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +98 to +99
val localPath = SparkFiles.get(basename)
val localFile = new File(localPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast with the docs it seems that this code downloads the data from the hdfs itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@james-willis
Copy link
Collaborator

Looks good other than the docs.

@jiayuasu jiayuasu marked this pull request as ready for review February 11, 2026 21:42
@jiayuasu jiayuasu merged commit eed93cb into master Feb 12, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancements to libpostal integration: Fetch model from HDFS/object store

2 participants