Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735
Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…tadata#22644) - Integer version comparison (v10 > v9) via regex group capture - Single-pass listing: eliminates double bucket scan for non-Iceberg buckets - Mixed buckets: regular files outside Iceberg dirs are now yielded - Removes extra head_object/get_blob API calls (use listing size directly) - Fix get_tables_name_and_type return type annotation to 5-tuple - Update tests: remove _get_iceberg_tables direct calls, add v10 regression
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 4 resolved / 4 findingsAdds Iceberg table support for GCS and S3 Datalake connectors, resolving issues with version comparison, mixed table handling, redundant bucket listing, and incorrect type annotations. ✅ 4 resolved✅ Bug: Lexicographic version comparison fails for v10+ metadata
✅ Bug: Mixed Iceberg + regular files: regular tables silently dropped
✅ Performance: Double bucket listing for non-Iceberg S3/GCS buckets
✅ Quality: Return type annotation still says 4-tuple, yields 5-tuple
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
Hi @harshach sir could you please add a |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
thanks @harshach sir i will monitor the checks and work accordingly!🙏 |
|
🟡 Playwright Results — all passed (15 flaky)✅ 3958 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 86 skipped
🟡 15 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
oops the sonarqube failed i will check it shortly ! |


🧊 Iceberg Table Support for GCS and S3 Datalake Connector
Fixes #22644
The Problem
Companies migrating from BigQuery to Iceberg on GCS could not
ingest Iceberg table metadata into OpenMetadata. The Datalake
connector already supported GCS and already had Iceberg JSON
parsing logic — but two precise bugs prevented them from
working together.
Bug 1 — Wrong Table Discovery
DatalakeGcsClient.get_table_names()listed every blobindividually. An Iceberg table
orders/produced 5+ entries:orders/metadata/v1.metadata.json → treated as separate table
orders/metadata/v2.metadata.json → treated as separate table
orders/data/00000-0-abc.parquet → treated as separate table
Result: 5 bogus tables instead of 1 correct
orderstable.Bug 2 — Iceberg Column Parser Never Called
In
readers/dataframe/json.py, theraw_datagate onlyopened for JSON files containing a
"$schema"key.Iceberg metadata JSON uses
"format-version"— soraw_datawas alwaysNone, meaning_is_iceberg_delta_metadata()and_parse_iceberg_delta_schema()were never reached despitebeing already correctly implemented.
Result: Iceberg tables showed garbage columns
(
format-version,table-uuid,location)instead of the actual data schema.
The Fix — 5 Production Files, Zero New Abstractions
All fixes follow existing patterns exactly and reuse
existing parsing logic unchanged.
Fix 1 —
raw_dataGate (readers/dataframe/json.py)This single change unlocks the entire Iceberg parsing pipeline
that was already correctly implemented in
datalake_utils.py.Fix 2 — Iceberg Table Directory Detection (
gcs.py+s3.py)Added
_get_iceberg_tables()to bothDatalakeGcsClientand
DatalakeS3Client. Detects Iceberg table directoriesby scanning for
*/metadata/v*.metadata.jsonpattern,keeps only the latest version per table directory, and
yields one entry per Iceberg table instead of individual blobs.
Non-Iceberg buckets fall through to the original listing
behavior — zero breaking changes.
Fix 3 — Table Name + Type (
datalake_utils.py+metadata.py)Added
get_iceberg_table_name_from_metadata_path()whichextracts the directory name from the metadata path:
"warehouse/orders/metadata/v2.metadata.json"→"orders"standardize_table_name()now uses this for Iceberg paths.get_tables_name_and_type()now yieldsTableType.Icebergfor Iceberg metadata files.
Fix 4 — Fetch Path Preservation (
metadata.py)Ensured
fetch_dataframe_first_chunk()receives the originalmetadata blob path for fetching while the Table entity displays
the clean directory name. Both values flow through the pipeline
correctly.
Complete E2E Flow After Fix
For an Iceberg table
warehouse/orders/on GCS:get_table_names()standardize_table_name()warehouse/orders/metadata/v2.metadata.jsonordersTableTypeRegularIcebergraw_datagateNone→ parser never called_is_iceberg_delta_metadata()calledget_columns()schema.fieldsWhat Was NOT Changed (Reused 100%)
Following @harshach's direction to reuse existing logic:
_is_iceberg_delta_metadata()— zero changes_parse_iceberg_delta_schema()— zero changes_parse_struct_fields()— zero changesset_google_credentials()— zero changesDatalakeGcsClientcredential initialization — zero changesTests — 18 Tests, Zero Infrastructure Required
tests/unit/readers/test_json_reader.py:

tests/unit/source/database/test_iceberg_discovery.py:

New tests added: 18 across 2 files
tests/unit/readers/test_json_reader.py— 4 new tests:test_raw_data_set_for_iceberg_metadata— gate opens for Iceberg JSONtest_iceberg_columns_parsed_correctly— correct columns extractedtest_raw_data_none_for_regular_json— backward compatibilitytest_raw_data_set_for_json_schema— existing behavior preservedtests/unit/source/database/test_iceberg_discovery.py— 14 new tests:fallback for non-Iceberg, mixed bucket handling
Files Changed
readers/dataframe/json.pydatalake/clients/gcs.pydatalake/clients/s3.pyutils/datalake/datalake_utils.pydatalake/metadata.pytests/unit/readers/test_json_reader.pytests/unit/source/database/test_iceberg_discovery.pyType of Change
Checklist
Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connectorhard-to-understand areas
All fixes are within the Python ingestion layer only.
Existing GCS and S3 credential schemas are reused unchanged.
implements it fully following maintainer guidance