Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector by mohitjeswani01 · Pull Request #27735 · open-metadata/OpenMetadata

mohitjeswani01 · 2026-04-25T21:18:26Z

🧊 Iceberg Table Support for GCS and S3 Datalake Connector

Submitted as part of the WeMakeDevs × OpenMetadata Hackathon
(Main Project Track — OpenMetadata Connectors & Ingestion)
Implemented following maintainer direction from @harshach:
"Check existing connectors such as S3 connector how its supporting
iceberg and reuse existing parsing logic that's already there"

The Problem

Companies migrating from BigQuery to Iceberg on GCS could not
ingest Iceberg table metadata into OpenMetadata. The Datalake
connector already supported GCS and already had Iceberg JSON
parsing logic — but two precise bugs prevented them from
working together.

Bug 1 — Wrong Table Discovery
DatalakeGcsClient.get_table_names() listed every blob
individually. An Iceberg table orders/ produced 5+ entries:

orders/metadata/v1.metadata.json → treated as separate table
orders/metadata/v2.metadata.json → treated as separate table
orders/data/00000-0-abc.parquet → treated as separate table

Result: 5 bogus tables instead of 1 correct orders table.

Bug 2 — Iceberg Column Parser Never Called
In readers/dataframe/json.py, the raw_data gate only
opened for JSON files containing a "$schema" key.
Iceberg metadata JSON uses "format-version" — so
raw_data was always None, meaning
_is_iceberg_delta_metadata() and
_parse_iceberg_delta_schema() were never reached despite
being already correctly implemented.

Result: Iceberg tables showed garbage columns
(format-version, table-uuid, location)
instead of the actual data schema.

The Fix — 5 Production Files, Zero New Abstractions

All fixes follow existing patterns exactly and reuse
existing parsing logic unchanged.

Fix 1 — `raw_data` Gate (`readers/dataframe/json.py`)

# BEFORE — only JSON Schema files triggered Iceberg parsing:
raw_data = content if data.get("$schema") else None

# AFTER — Iceberg metadata files now also trigger it:
raw_data = (
    content if isinstance(data, dict) and (
        data.get("$schema") is not None           # JSON Schema (existing)
        or data.get("format-version") is not None # Apache Iceberg ← NEW
        or (                                       # Delta Lake structure ← NEW
            isinstance(data.get("schema"), dict)
            and isinstance(data.get("schema", {}).get("fields"), list)
        )
    ) else None
)

This single change unlocks the entire Iceberg parsing pipeline
that was already correctly implemented in datalake_utils.py.

Fix 2 — Iceberg Table Directory Detection (`gcs.py` + `s3.py`)

Added _get_iceberg_tables() to both DatalakeGcsClient
and DatalakeS3Client. Detects Iceberg table directories
by scanning for */metadata/v*.metadata.json pattern,
keeps only the latest version per table directory, and
yields one entry per Iceberg table instead of individual blobs.

Non-Iceberg buckets fall through to the original listing
behavior — zero breaking changes.

Fix 3 — Table Name + Type (`datalake_utils.py` + `metadata.py`)

Added get_iceberg_table_name_from_metadata_path() which
extracts the directory name from the metadata path:
"warehouse/orders/metadata/v2.metadata.json" → "orders"

standardize_table_name() now uses this for Iceberg paths.
get_tables_name_and_type() now yields TableType.Iceberg
for Iceberg metadata files.

Fix 4 — Fetch Path Preservation (`metadata.py`)

Ensured fetch_dataframe_first_chunk() receives the original
metadata blob path for fetching while the Table entity displays
the clean directory name. Both values flow through the pipeline
correctly.

Complete E2E Flow After Fix

For an Iceberg table warehouse/orders/ on GCS:

Step	Before	After
`get_table_names()`	5 blobs yielded	1 entry: latest metadata path
`standardize_table_name()`	`warehouse/orders/metadata/v2.metadata.json`	`orders`
`TableType`	`Regular`	`Iceberg`
`raw_data` gate	`None` → parser never called	Set → `_is_iceberg_delta_metadata()` called
`get_columns()`	Garbage keys from JSON	Correct schema from `schema.fields`

What Was NOT Changed (Reused 100%)

Following @harshach's direction to reuse existing logic:

_is_iceberg_delta_metadata() — zero changes
_parse_iceberg_delta_schema() — zero changes
_parse_struct_fields() — zero changes
set_google_credentials() — zero changes
DatalakeGcsClient credential initialization — zero changes
No new connector directory created
No schema JSON changes
No frontend changes
No Java changes

Tests — 18 Tests, Zero Infrastructure Required

tests/unit/readers/test_json_reader.py:

tests/unit/source/database/test_iceberg_discovery.py:

New tests added: 18 across 2 files

tests/unit/readers/test_json_reader.py — 4 new tests:

test_raw_data_set_for_iceberg_metadata — gate opens for Iceberg JSON
test_iceberg_columns_parsed_correctly — correct columns extracted
test_raw_data_none_for_regular_json — backward compatibility
test_raw_data_set_for_json_schema — existing behavior preserved

tests/unit/source/database/test_iceberg_discovery.py — 14 new tests:

GCS: table detection, single table yield, multiple tables,
fallback for non-Iceberg, mixed bucket handling
S3: same coverage for parity
Table name extraction: correct names, non-Iceberg returns None
TableType: Iceberg for metadata files, Regular for others

Files Changed

File	Change
`readers/dataframe/json.py`	+12 lines — raw_data gate fix
`datalake/clients/gcs.py`	+35 lines — Iceberg directory detection
`datalake/clients/s3.py`	+38 lines — symmetric S3 fix
`utils/datalake/datalake_utils.py`	+22 lines — table name helper
`datalake/metadata.py`	+6 lines — TableType + fetch path
`tests/unit/readers/test_json_reader.py`	+80 lines — 4 tests
`tests/unit/source/database/test_iceberg_discovery.py`	+305 lines — 14 tests (new file)

Type of Change

Bug fix
New feature

Checklist

I have read the CONTRIBUTING document
My PR title is Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector
I have commented on my code, particularly in
hard-to-understand areas
For JSON Schema changes: No schema changes were made.
All fixes are within the Python ingestion layer only.
Existing GCS and S3 credential schemas are reused unchanged.
I have added tests around the new logic (16 new tests)
The issue properly describes the goal and this PR
implements it fully following maintainer guidance

…data#22644)

github-actions · 2026-04-25T21:18:53Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-25T21:19:24Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…tadata#22644) - Integer version comparison (v10 > v9) via regex group capture - Single-pass listing: eliminates double bucket scan for non-Iceberg buckets - Mixed buckets: regular files outside Iceberg dirs are now yielded - Removes extra head_object/get_blob API calls (use listing size directly) - Fix get_tables_name_and_type return type annotation to 5-tuple - Update tests: remove _get_iceberg_tables direct calls, add v10 regression

github-actions · 2026-04-25T21:55:23Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

gitar-bot · 2026-04-25T21:56:27Z

Code Review ✅ Approved 4 resolved / 4 findings

Adds Iceberg table support for GCS and S3 Datalake connectors, resolving issues with version comparison, mixed table handling, redundant bucket listing, and incorrect type annotations.

✅ 4 resolved

✅ Bug: Lexicographic version comparison fails for v10+ metadata

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:140 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:90
The Iceberg "latest metadata" detection uses lexicographic string comparison (key_name > existing / blob.name > existing) to determine the highest version. This breaks for version numbers ≥ 10 because v9.metadata.json > v10.metadata.json lexicographically. The GCS docstring even acknowledges this ("v10 > v9 in padded form; for typical low version counts this is correct") but the assumption is fragile — production Iceberg tables commonly exceed 10 metadata versions as each schema evolution or snapshot creates a new version.

Extract the numeric version and compare as integers instead.

✅ Bug: Mixed Iceberg + regular files: regular tables silently dropped

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:153-157 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:101-107
When _get_iceberg_tables() returns any results, get_table_names() yields only Iceberg metadata entries and then returns, completely skipping all non-Iceberg files (CSV, Parquet, etc.) in the same bucket/prefix. This means a bucket containing both Iceberg tables and regular data files will lose visibility of all regular tables.

The test test_gcs_mixed_iceberg_and_regular_files documents this as intentional, but it's a data loss issue for users who legitimately have mixed content in a single bucket prefix. Instead of an early return, yield both Iceberg and non-Iceberg entries, filtering out blobs that belong to Iceberg table directories (metadata/, data/ subdirs).

✅ Performance: Double bucket listing for non-Iceberg S3/GCS buckets

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:100 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:109 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:152 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:159
get_table_names() calls _get_iceberg_tables() which iterates all objects in the bucket via paginated API calls, then if no Iceberg tables are found, iterates all objects again in the regular listing loop. For large buckets with thousands of objects this doubles the number of LIST API calls (which are also billed per-request on S3/GCS).

Consider doing a single pass: iterate objects once, check each for the Iceberg metadata pattern, collect Iceberg tables and regular files in one loop, then yield appropriately.

✅ Quality: Return type annotation still says 4-tuple, yields 5-tuple

📄 ingestion/src/metadata/ingestion/source/database/datalake/metadata.py:230 📄 ingestion/src/metadata/ingestion/source/database/datalake/metadata.py:278
The get_tables_name_and_type method's return type annotation at line 230 still declares Iterable[Tuple[str, TableType, SupportedTypes, Optional[int]]] but the method now yields a 5-tuple including key_name. This will cause type-checking tools (mypy, pyright) to flag a mismatch and confuses readers about the method's contract.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

mohitjeswani01 · 2026-04-25T21:59:38Z

Hi @harshach sir could you please add a safe to test label? thanks you ! 🙏

github-actions · 2026-04-26T14:16:22Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

mohitjeswani01 · 2026-04-26T14:43:22Z

thanks @harshach sir i will monitor the checks and work accordingly!🙏

sonarqubecloud · 2026-04-26T15:25:18Z

Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
E Security Review Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

github-actions · 2026-04-26T16:39:41Z

🟡 Playwright Results — all passed (15 flaky)

✅ 3958 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 86 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	298	1	4
🟡 Shard 2	755	4	8
🟡 Shard 3	730	2	7
🟡 Shard 4	758	1	18
🟡 Shard 5	686	1	41
🟡 Shard 6	731	6	8

🟡 15 flaky test(s) (passed on retry)

Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
Features/DomainFilterQueryFilter.spec.ts › Subdomain assets should be visible when parent domain is selected (shard 2, 1 retry)
Features/DomainFilterQueryFilter.spec.ts › Domain filter should use exact match and prefix with dot to prevent false positives (shard 2, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for Worksheet (shard 4, 1 retry)
Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
Pages/Glossary.spec.ts › Create glossary with all optional fields (tags, owners, reviewers, domain) (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for dashboardDataModel -> apiEndpoint (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
Pages/ServiceEntity.spec.ts › Announcement create, edit & delete (shard 6, 1 retry)
Pages/TasksUIFlow.spec.ts › Verify task lifecycle in activity feed (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

mohitjeswani01 · 2026-04-26T20:28:41Z

Quality Gate failed for 'open-metadata-ingestion'

Failed conditions E Security Review Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

oops the sonarqube failed i will check it shortly !

feat(datalake): add GCS/S3 Iceberg table ingestion support (open-meta…

eb37c5c

…data#22644)

Copilot AI review requested due to automatic review settings April 25, 2026 21:18

mohitjeswani01 requested a review from a team as a code owner April 25, 2026 21:18

Copilot started reviewing on behalf of mohitjeswani01 April 25, 2026 21:18 View session

Merge branch 'main' into feature/22644-iceberg-gcp-support

08bd115

gitar-bot Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py Outdated

gitar-bot Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py Outdated

gitar-bot Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py Outdated

gitar-bot Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread ingestion/src/metadata/ingestion/source/database/datalake/metadata.py

Copilot AI reviewed Apr 25, 2026

View reviewed changes

harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 26, 2026

harshach temporarily deployed to test April 26, 2026 14:20 — with GitHub Actions Inactive

Conversation

mohitjeswani01 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧊 Iceberg Table Support for GCS and S3 Datalake Connector

The Problem

The Fix — 5 Production Files, Zero New Abstractions

Fix 1 — raw_data Gate (readers/dataframe/json.py)

Fix 2 — Iceberg Table Directory Detection (gcs.py + s3.py)

Fix 3 — Table Name + Type (datalake_utils.py + metadata.py)

Fix 4 — Fetch Path Preservation (metadata.py)

Complete E2E Flow After Fix

What Was NOT Changed (Reused 100%)

Tests — 18 Tests, Zero Infrastructure Required

Files Changed

Type of Change

Checklist

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

gitar-bot Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohitjeswani01 commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

mohitjeswani01 commented Apr 26, 2026

Uh oh!

sonarqubecloud Bot commented Apr 26, 2026

Quality Gate failed for 'open-metadata-ingestion'

Uh oh!

github-actions Bot commented Apr 26, 2026

🟡 Playwright Results — all passed (15 flaky)

Uh oh!

mohitjeswani01 commented Apr 26, 2026

Quality Gate failed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohitjeswani01 commented Apr 25, 2026 •

edited

Loading

Fix 1 — `raw_data` Gate (`readers/dataframe/json.py`)

Fix 2 — Iceberg Table Directory Detection (`gcs.py` + `s3.py`)

Fix 3 — Table Name + Type (`datalake_utils.py` + `metadata.py`)

Fix 4 — Fetch Path Preservation (`metadata.py`)

gitar-bot Bot commented Apr 25, 2026 •

edited

Loading