-
Notifications
You must be signed in to change notification settings - Fork 748
[GH-2360] Support fetching libpostal model data from HDFS/object store #2637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7d39f23 to
18190fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds support for resolving libpostal model data directories from remote Hadoop-compatible filesystems by copying them into a per-node local cache before initializing jpostal.
Changes:
- Introduces
HadoopFileSystemUtilsto copy files/directories from remote FS to local, and reuses it from GeoPackage utilities. - Adds
LibPostalDataLoaderto detect remote URIs, download to a hashed local cache directory, and guard concurrent downloads with per-key locks. - Updates libpostal initialization and documentation; adds unit tests using
MiniDFSCluster.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/sedona/sql/utils/HadoopFileSystemUtils.scala | New shared Hadoop FS → local copy helpers for files and directory trees. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala | New remote libpostal dataDir resolver with local caching and concurrency control. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalUtils.scala | Resolves dataDir via loader and disables jpostal download for remote URIs. |
| spark/common/src/main/scala/org/apache/sedona/sql/datasources/geopackage/connection/FileSystemUtils.scala | Refactors GeoPackage copy utility to delegate to shared Hadoop FS helper. |
| spark/common/src/test/scala/org/apache/sedona/sql/HadoopFileSystemUtilsTest.scala | New tests for local/remote detection and HDFS copy behaviors. |
| spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala | New tests for remote-path detection, caching behavior, and concurrent access. |
| docs/api/sql/Function.md | Documents remote URI support for libpostal dataDir configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
18190fd to
8464efc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
spark/common/src/main/scala/org/apache/sedona/sql/utils/HadoopFileSystemUtils.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/HadoopFileSystemUtilsTest.scala
Outdated
Show resolved
Hide resolved
bea9f8c to
6f7d89d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
e2059a2 to
c7b509a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Outdated
Show resolved
Hide resolved
c7b509a to
fa29ea9
Compare
fa29ea9 to
74fe0f0
Compare
74fe0f0 to
41c1ac9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/LibPostalDataLoaderTest.scala
Outdated
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
.../common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalDataLoader.scala
Show resolved
Hide resolved
41c1ac9 to
16272c2
Compare
16272c2 to
b9a8d2c
Compare
…store - Add shared HadoopFileSystemUtils with copyFileToLocal and copyDirectoryToLocal - Refactor GeoPackage FileSystemUtils to delegate to shared utility - Add LibPostalDataLoader to resolve remote dataDir (HDFS, S3, GCS, ABFS) to local cache - Update LibPostalUtils to use LibPostalDataLoader for remote path resolution - Disable jpostal auto-download when data is fetched from remote store - Add tests for HadoopFileSystemUtils and LibPostalDataLoader using MiniDFSCluster - Update docs for ExpandAddress and ParseAddress with remote dataDir instructions
b9a8d2c to
2a55427
Compare
docs/api/sql/Function.md
Outdated
| ## ExpandAddress | ||
|
|
||
| Introduction: Returns an array of expanded forms of the input address string. This is backed by the [libpostal](https://github.com/openvenues/libpostal) library's address expanding functionality. | ||
| Introduction: Returns an array of expanded forms of the input address string. This is backed by the [libpostal](https://github.com/openvenues/libpostal) library's address expanding functionality. Jpostal requires at least 2 GB of free disk space to store the data files used for address parsing and expanding. By default, the data files are downloaded automatically to a temporary directory (`<java.io.tmpdir>/libpostal/`, e.g. `/tmp/libpostal/` on Linux/macOS) when the function is called for the first time. The version of jpostal installed with this package only supports Linux and macOS. If you are using Windows, you will need to install libjpostal and libpostal manually and ensure that they are available in your `java.library.path`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be in a note, not the intro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i intendtionally removed it since there are too many notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now I bring it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe they dont need to be notes but they shouldnt be part of the intro
docs/api/sql/Function.md
Outdated
|
|
||
| !!!Note | ||
| Jpostal requires at least 2 GB of free disk space to store the data files used for address parsing and expanding. The data files are downloaded automatically when the function is called for the first time. | ||
| The data directory can be configured via `spark.sedona.libpostal.dataDir`. You can point it to a remote filesystem path (HDFS, S3, GCS, ABFS, etc.) such as `hdfs:///data/libpostal/` or `s3a://my-bucket/libpostal/`. When using a remote path, you must distribute the data to all executors before running queries by calling `sc.addFile("hdfs:///data/libpostal/", recursive=True)` (PySpark) or `sc.addFile("hdfs:///data/libpostal/", recursive = true)` (Scala). In this remote-URI mode, the automatic internet download performed by jpostal is disabled, so the remote directory must already contain the libpostal model files. For local filesystem paths, jpostal's download-if-needed behavior remains enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldnt this be automatically distirbuted as part of the implementation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sc.addFile needs to be called from the driver before tasks run on executors
It requires access to SparkContext, which isn't readily available from within a UDF/expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle this when we initialize the spark session with sedona?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to handle that because no everybody needs libpostal so it has to be optional.
But if it requires user to opt in, then it is no different than a sc.addFile call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. let me think about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a two pronged approach would work:
- add an optmizer rule that performs the sc.addFile call when it sees a libpostal method in the plan
- in the execution itself we also try to perform a sc.addFile call to account for the cosntantFolding case.
Not totally sure if ive missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automated sc.addFile per query is dangerous and we might end up handling race condition. I think this over complicates the problem. I would rather not have this feature at all.
docs/api/sql/Function.md
Outdated
|
|
||
| !!!Note | ||
| The version of jpostal installed with this package only supports Linux and MacOS. If you are using Windows, you will need to install libjpostal and libpostal manually and ensure that they are available in your `java.library.path`. | ||
| To prepare the libpostal data for a remote filesystem, first download it to a local machine by following the [libpostal installation instructions](https://github.com/openvenues/libpostal#installation-maclinux). After installation, the data files will be in the directory you specified during setup (commonly `/tmp/libpostal/`). Then upload them to your remote storage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure if "how to upload data to s3" is a neccessary part of our docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| val localPath = SparkFiles.get(basename) | ||
| val localFile = new File(localPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in contrast with the docs it seems that this code downloads the data from the hdfs itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it does not. It checks if Spark actually downloads it. See: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addFile.html#pyspark.SparkContext.addFile
|
Looks good other than the docs. |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[SEDONA-XXX] my subject. Closes Enhancements to libpostal integration: Fetch model from HDFS/object store #2360What changes were proposed in this PR?
This PR enables libpostal model data (
spark.sedona.libpostal.dataDir) to be loaded from remote filesystems such as HDFS, S3 (s3a://), GCS (gs://), and ABFS (abfs://), in addition to local paths.Problem
jpostal requires the ~2 GB libpostal model data to reside on the local filesystem. In cloud deployments, users must manually pre-install the data on every executor node, which is operationally cumbersome.
Solution
Leverages Spark's native file distribution mechanism (
SparkContext.addFile/SparkFiles.get) to distribute remote data to executors. Users callsc.addFile(remotePath, recursive = true)once before running queries, and Sedona resolves the data locally viaSparkFiles.get(basename).LibPostalDataLoader: New object that resolves remote paths viaSparkFiles.get(). For local paths andfile://URIs, values are passed through unchanged (withfile:URIs normalized to plain paths). For remote URIs, the basename is extracted and looked up through SparkFiles. Uses case-insensitive scheme comparison forfile:URIs.LibPostalUtils: Modified to callLibPostalDataLoader.resolveDataDir()before jpostal init, and to disable jpostal's built-in download when using a remote path.sc.addFile, anIllegalStateExceptionis thrown with instructions to callsc.addFile(remotePath, recursive = true).Files changed
spark/common/.../expressions/LibPostalDataLoader.scalaspark/common/.../expressions/LibPostalUtils.scalaLibPostalDataLoader.resolveDataDir()before jpostal initdocs/api/sql/Function.mdsc.addFileusage forExpandAddressandParseAddressspark/common/.../sql/LibPostalDataLoaderTest.scalaHow was this patch tested?
isRemotePath— 11 URI schemes (local, relative,file://,hdfs://with/without host,s3a://,gs://,abfs://,wasb://, empty, Windows)resolveDataDir— local path unchanged,file://URI normalizationIllegalStateExceptionwith clearsc.addFile(..., recursive = true)instructions when data not foundresolveDataDirresolves a remote URI to the correct local path with all subdirectories intactmvn test -pl spark/common -Dspark=3.5 -Dscala=2.12 -Dtest=none -DfailIfNoTests=false -DwildcardSuites="org.apache.sedona.sql.LibPostalDataLoaderTest"Did this PR include necessary documentation updates?
docs/api/sql/Function.mdfor bothExpandAddressandParseAddressto document the default data directory (<java.io.tmpdir>/libpostal/),spark.sedona.libpostal.dataDirconfiguration, data preparation instructions (libpostal install + upload to HDFS/S3), and instructions for usingsc.addFilewith remote paths.