Skip to content

feat(hive): add is_partition_column flag and get_table_partition_details#27731

Open
Reet24-del wants to merge 2 commits intoopen-metadata:mainfrom
Reet24-del:feat/hive-partition-key-flag
Open

feat(hive): add is_partition_column flag and get_table_partition_details#27731
Reet24-del wants to merge 2 commits intoopen-metadata:mainfrom
Reet24-del:feat/hive-partition-key-flag

Conversation

@Reet24-del
Copy link
Copy Markdown

Summary

Fixes #26712

The Hive connector mixed partition columns (folder-path segments like year, country) with regular data columns in the ingested schema, making it impossible to distinguish partition keys from data columns. This made it hard to write efficient queries and understand table organization. The two existing PRs (#27029 and #27278) addressed this but have not landed — this is a clean reimplementation.

Changes

ingestion/src/metadata/ingestion/source/database/hive/utils.py

  • Added _get_partition_column_names(rows): a strict state-machine parser that reads the # Partition Information section of DESCRIBE FORMATTED output and returns the set of partition column names. Exits cleanly on the next #-prefixed section header (e.g. # Detailed Table Information) to avoid false positives from Owner:/Location: rows.
  • get_columns now calls _get_partition_column_names before the main loop and attaches is_partition_column=True to every column that appears in the partition section. Non-partitioned tables get is_partition_column=False on all columns — no regression to existing behaviour.

ingestion/src/metadata/ingestion/source/database/hive/metadata.py

  • Added get_table_partition_details(table_name, schema_name, inspector) to HiveSource. It reads the is_partition_column flag produced by get_columns and builds a TablePartition with PartitionColumnDetails for each partition key, using PartitionIntervalTypes.COLUMN_VALUE (Hive partitions are value-based folder segments, not time/integer intervals).
  • Returns (False, None) gracefully for non-partitioned tables and on any unexpected error — ingestion never breaks for users without partitioned tables.

ingestion/tests/unit/topology/database/test_hive.py

Added TestHivePartitionKeyFlag with 8 unit tests covering:

  • _get_partition_column_names: no-partition table, single key, multiple keys, correct exit on the next section header.
  • get_columns: single partition key flagged, non-partitioned table, multiple partition keys — all with correct is_partition_column values.

How to test

cd ingestion
python -m pytest tests/unit/topology/database/test_hive.py::TestHivePartitionKeyFlag -v

All 8 new tests pass alongside the existing Hive test suite.

Resolves open-metadata#26712

**Problem:**
The Hive connector mixed partition columns (folder-path segments like year,
country) with regular data columns in the ingested schema. There was no way
for users to distinguish which columns are partition keys, making it hard to
write efficient queries or understand table organization.

**Changes:**

ingestion/src/metadata/ingestion/source/database/hive/utils.py
- Add _get_partition_column_names(rows): a strict state-machine parser that
  reads the '# Partition Information' section of DESCRIBE FORMATTED output
  and returns a set of partition column names. Exits cleanly on the next
  '#'-prefixed section header to avoid false positives from Owner:/Location:
  rows.
- get_columns now calls _get_partition_column_names before the main loop and
  attaches is_partition_column=True to every column that appears in the
  partition section. Non-partitioned tables get is_partition_column=False on
  all columns (no regression).

ingestion/src/metadata/ingestion/source/database/hive/metadata.py
- Add get_table_partition_details(table_name, schema_name, inspector) to
  HiveSource. It reads the is_partition_column flag produced by get_columns
  and builds a TablePartition with PartitionColumnDetails for each partition
  key, using PartitionIntervalTypes.COLUMN_VALUE (Hive partitions are
  value-based folder segments, not time/integer intervals).
- Returns (False, None) gracefully for non-partitioned tables and on any
  unexpected error.

ingestion/tests/unit/topology/database/test_hive.py
- Add TestHivePartitionKeyFlag with 8 unit tests covering:
  - _get_partition_column_names: no-partition table, single key, multiple
    keys, and correct exit on the next section header.
  - get_columns: single partition key flagged, non-partitioned table,
    multiple partition keys, all with correct is_partition_column values.
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!


# Sentinel header names that appear in DESCRIBE FORMATTED output; these rows
# are metadata rows, not real columns.
_DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Unused module-level constant _DESCRIBE_SECTION_HEADERS

The constant _DESCRIBE_SECTION_HEADERS defined at line 27 is never referenced anywhere in the codebase. It appears to be a leftover from an earlier design where section header detection was centralised. Since _get_partition_column_names and get_columns both use inline string literals instead, this constant is dead code.

Suggested fix:

Remove the unused constant:

-_DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reet24-del can you check this comment

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 25, 2026

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Implements is_partition_column and get_table_partition_details to enhance Hive metadata handling. Remove the unused module-level constant _DESCRIBE_SECTION_HEADERS to clean up the implementation.

💡 Quality: Unused module-level constant _DESCRIBE_SECTION_HEADERS

📄 ingestion/src/metadata/ingestion/source/database/hive/utils.py:27

The constant _DESCRIBE_SECTION_HEADERS defined at line 27 is never referenced anywhere in the codebase. It appears to be a leftover from an earlier design where section header detection was centralised. Since _get_partition_column_names and get_columns both use inline string literals instead, this constant is dead code.

Suggested fix
Remove the unused constant:

-_DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"}
🤖 Prompt for agents
Code Review: Implements `is_partition_column` and `get_table_partition_details` to enhance Hive metadata handling. Remove the unused module-level constant `_DESCRIBE_SECTION_HEADERS` to clean up the implementation.

1. 💡 Quality: Unused module-level constant `_DESCRIBE_SECTION_HEADERS`
   Files: ingestion/src/metadata/ingestion/source/database/hive/utils.py:27

   The constant `_DESCRIBE_SECTION_HEADERS` defined at line 27 is never referenced anywhere in the codebase. It appears to be a leftover from an earlier design where section header detection was centralised. Since `_get_partition_column_names` and `get_columns` both use inline string literals instead, this constant is dead code.

   Suggested fix:
   Remove the unused constant:
   
   -_DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"}

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add metadata flag to identify Hive partition keys

2 participants