Skip to content

Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735

Open
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:feature/22644-iceberg-gcp-support
Open

Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector#27735
mohitjeswani01 wants to merge 3 commits intoopen-metadata:mainfrom
mohitjeswani01:feature/22644-iceberg-gcp-support

Conversation

@mohitjeswani01
Copy link
Copy Markdown

@mohitjeswani01 mohitjeswani01 commented Apr 25, 2026

🧊 Iceberg Table Support for GCS and S3 Datalake Connector

Fixes #22644

Submitted as part of the WeMakeDevs × OpenMetadata Hackathon
(Main Project Track — OpenMetadata Connectors & Ingestion)
Implemented following maintainer direction from @harshach:
"Check existing connectors such as S3 connector how its supporting
iceberg and reuse existing parsing logic that's already there"


The Problem

Companies migrating from BigQuery to Iceberg on GCS could not
ingest Iceberg table metadata into OpenMetadata. The Datalake
connector already supported GCS and already had Iceberg JSON
parsing logic — but two precise bugs prevented them from
working together.

Bug 1 — Wrong Table Discovery
DatalakeGcsClient.get_table_names() listed every blob
individually. An Iceberg table orders/ produced 5+ entries:

orders/metadata/v1.metadata.json → treated as separate table
orders/metadata/v2.metadata.json → treated as separate table
orders/data/00000-0-abc.parquet → treated as separate table

Result: 5 bogus tables instead of 1 correct orders table.

Bug 2 — Iceberg Column Parser Never Called
In readers/dataframe/json.py, the raw_data gate only
opened for JSON files containing a "$schema" key.
Iceberg metadata JSON uses "format-version" — so
raw_data was always None, meaning
_is_iceberg_delta_metadata() and
_parse_iceberg_delta_schema() were never reached despite
being already correctly implemented.

Result: Iceberg tables showed garbage columns
(format-version, table-uuid, location)
instead of the actual data schema.


The Fix — 5 Production Files, Zero New Abstractions

All fixes follow existing patterns exactly and reuse
existing parsing logic unchanged.

Fix 1 — raw_data Gate (readers/dataframe/json.py)

# BEFORE — only JSON Schema files triggered Iceberg parsing:
raw_data = content if data.get("$schema") else None

# AFTER — Iceberg metadata files now also trigger it:
raw_data = (
    content if isinstance(data, dict) and (
        data.get("$schema") is not None           # JSON Schema (existing)
        or data.get("format-version") is not None # Apache Iceberg ← NEW
        or (                                       # Delta Lake structure ← NEW
            isinstance(data.get("schema"), dict)
            and isinstance(data.get("schema", {}).get("fields"), list)
        )
    ) else None
)

This single change unlocks the entire Iceberg parsing pipeline
that was already correctly implemented in datalake_utils.py.

Fix 2 — Iceberg Table Directory Detection (gcs.py + s3.py)

Added _get_iceberg_tables() to both DatalakeGcsClient
and DatalakeS3Client. Detects Iceberg table directories
by scanning for */metadata/v*.metadata.json pattern,
keeps only the latest version per table directory, and
yields one entry per Iceberg table instead of individual blobs.

Non-Iceberg buckets fall through to the original listing
behavior — zero breaking changes.

Fix 3 — Table Name + Type (datalake_utils.py + metadata.py)

Added get_iceberg_table_name_from_metadata_path() which
extracts the directory name from the metadata path:
"warehouse/orders/metadata/v2.metadata.json""orders"

standardize_table_name() now uses this for Iceberg paths.
get_tables_name_and_type() now yields TableType.Iceberg
for Iceberg metadata files.

Fix 4 — Fetch Path Preservation (metadata.py)

Ensured fetch_dataframe_first_chunk() receives the original
metadata blob path for fetching while the Table entity displays
the clean directory name. Both values flow through the pipeline
correctly.


Complete E2E Flow After Fix

For an Iceberg table warehouse/orders/ on GCS:

Step Before After
get_table_names() 5 blobs yielded 1 entry: latest metadata path
standardize_table_name() warehouse/orders/metadata/v2.metadata.json orders
TableType Regular Iceberg
raw_data gate None → parser never called Set → _is_iceberg_delta_metadata() called
get_columns() Garbage keys from JSON Correct schema from schema.fields

What Was NOT Changed (Reused 100%)

Following @harshach's direction to reuse existing logic:

  • _is_iceberg_delta_metadata() — zero changes
  • _parse_iceberg_delta_schema() — zero changes
  • _parse_struct_fields() — zero changes
  • set_google_credentials() — zero changes
  • DatalakeGcsClient credential initialization — zero changes
  • No new connector directory created
  • No schema JSON changes
  • No frontend changes
  • No Java changes

Tests — 18 Tests, Zero Infrastructure Required

tests/unit/readers/test_json_reader.py:
Screenshot 2026-04-26 021022

tests/unit/source/database/test_iceberg_discovery.py:
Screenshot 2026-04-26 022429

New tests added: 18 across 2 files

tests/unit/readers/test_json_reader.py — 4 new tests:

  • test_raw_data_set_for_iceberg_metadata — gate opens for Iceberg JSON
  • test_iceberg_columns_parsed_correctly — correct columns extracted
  • test_raw_data_none_for_regular_json — backward compatibility
  • test_raw_data_set_for_json_schema — existing behavior preserved

tests/unit/source/database/test_iceberg_discovery.py — 14 new tests:

  • GCS: table detection, single table yield, multiple tables,
    fallback for non-Iceberg, mixed bucket handling
  • S3: same coverage for parity
  • Table name extraction: correct names, non-Iceberg returns None
  • TableType: Iceberg for metadata files, Regular for others

Files Changed

File Change
readers/dataframe/json.py +12 lines — raw_data gate fix
datalake/clients/gcs.py +35 lines — Iceberg directory detection
datalake/clients/s3.py +38 lines — symmetric S3 fix
utils/datalake/datalake_utils.py +22 lines — table name helper
datalake/metadata.py +6 lines — TableType + fetch path
tests/unit/readers/test_json_reader.py +80 lines — 4 tests
tests/unit/source/database/test_iceberg_discovery.py +305 lines — 14 tests (new file)

Type of Change

  • Bug fix
  • New feature

Checklist

  • I have read the CONTRIBUTING document
  • My PR title is Fixes #22644: Add Iceberg table support for GCS and S3 Datalake connector
  • I have commented on my code, particularly in
    hard-to-understand areas
  • For JSON Schema changes: No schema changes were made.
    All fixes are within the Python ingestion layer only.
    Existing GCS and S3 credential schemas are reused unchanged.
  • I have added tests around the new logic (16 new tests)
  • The issue properly describes the goal and this PR
    implements it fully following maintainer guidance

Copilot AI review requested due to automatic review settings April 25, 2026 21:18
@mohitjeswani01 mohitjeswani01 requested a review from a team as a code owner April 25, 2026 21:18
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…tadata#22644)

- Integer version comparison (v10 > v9) via regex group capture
- Single-pass listing: eliminates double bucket scan for non-Iceberg buckets
- Mixed buckets: regular files outside Iceberg dirs are now yielded
- Removes extra head_object/get_blob API calls (use listing size directly)
- Fix get_tables_name_and_type return type annotation to 5-tuple
- Update tests: remove _get_iceberg_tables direct calls, add v10 regression
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 25, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Adds Iceberg table support for GCS and S3 Datalake connectors, resolving issues with version comparison, mixed table handling, redundant bucket listing, and incorrect type annotations.

✅ 4 resolved
Bug: Lexicographic version comparison fails for v10+ metadata

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:140 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:90
The Iceberg "latest metadata" detection uses lexicographic string comparison (key_name > existing / blob.name > existing) to determine the highest version. This breaks for version numbers ≥ 10 because v9.metadata.json > v10.metadata.json lexicographically. The GCS docstring even acknowledges this ("v10 > v9 in padded form; for typical low version counts this is correct") but the assumption is fragile — production Iceberg tables commonly exceed 10 metadata versions as each schema evolution or snapshot creates a new version.

Extract the numeric version and compare as integers instead.

Bug: Mixed Iceberg + regular files: regular tables silently dropped

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:153-157 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:101-107
When _get_iceberg_tables() returns any results, get_table_names() yields only Iceberg metadata entries and then returns, completely skipping all non-Iceberg files (CSV, Parquet, etc.) in the same bucket/prefix. This means a bucket containing both Iceberg tables and regular data files will lose visibility of all regular tables.

The test test_gcs_mixed_iceberg_and_regular_files documents this as intentional, but it's a data loss issue for users who legitimately have mixed content in a single bucket prefix. Instead of an early return, yield both Iceberg and non-Iceberg entries, filtering out blobs that belong to Iceberg table directories (metadata/, data/ subdirs).

Performance: Double bucket listing for non-Iceberg S3/GCS buckets

📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:100 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py:109 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:152 📄 ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py:159
get_table_names() calls _get_iceberg_tables() which iterates all objects in the bucket via paginated API calls, then if no Iceberg tables are found, iterates all objects again in the regular listing loop. For large buckets with thousands of objects this doubles the number of LIST API calls (which are also billed per-request on S3/GCS).

Consider doing a single pass: iterate objects once, check each for the Iceberg metadata pattern, collect Iceberg tables and regular files in one loop, then yield appropriately.

Quality: Return type annotation still says 4-tuple, yields 5-tuple

📄 ingestion/src/metadata/ingestion/source/database/datalake/metadata.py:230 📄 ingestion/src/metadata/ingestion/source/database/datalake/metadata.py:278
The get_tables_name_and_type method's return type annotation at line 230 still declares Iterable[Tuple[str, TableType, SupportedTypes, Optional[int]]] but the method now yields a 5-tuple including key_name. This will cause type-checking tools (mypy, pyright) to flag a mismatch and confuses readers about the method's contract.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mohitjeswani01
Copy link
Copy Markdown
Author

Hi @harshach sir could you please add a safe to test label? thanks you ! 🙏

@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@mohitjeswani01
Copy link
Copy Markdown
Author

thanks @harshach sir i will monitor the checks and work accordingly!🙏

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
E Security Review Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (15 flaky)

✅ 3958 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 755 0 4 8
🟡 Shard 3 730 0 2 7
🟡 Shard 4 758 0 1 18
🟡 Shard 5 686 0 1 41
🟡 Shard 6 731 0 6 8
🟡 15 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Subdomain assets should be visible when parent domain is selected (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Domain filter should use exact match and prefix with dot to prevent false positives (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Worksheet (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Create glossary with all optional fields (tags, owners, reviewers, domain) (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for dashboardDataModel -> apiEndpoint (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Announcement create, edit & delete (shard 6, 1 retry)
  • Pages/TasksUIFlow.spec.ts › Verify task lifecycle in activity feed (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@mohitjeswani01
Copy link
Copy Markdown
Author

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions E Security Review Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

oops the sonarqube failed i will check it shortly !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Iceberg tables in GCP

3 participants