feat(hive): add is_partition_column flag and get_table_partition_details#27731
feat(hive): add is_partition_column flag and get_table_partition_details#27731Reet24-del wants to merge 2 commits intoopen-metadata:mainfrom
Conversation
Resolves open-metadata#26712 **Problem:** The Hive connector mixed partition columns (folder-path segments like year, country) with regular data columns in the ingested schema. There was no way for users to distinguish which columns are partition keys, making it hard to write efficient queries or understand table organization. **Changes:** ingestion/src/metadata/ingestion/source/database/hive/utils.py - Add _get_partition_column_names(rows): a strict state-machine parser that reads the '# Partition Information' section of DESCRIBE FORMATTED output and returns a set of partition column names. Exits cleanly on the next '#'-prefixed section header to avoid false positives from Owner:/Location: rows. - get_columns now calls _get_partition_column_names before the main loop and attaches is_partition_column=True to every column that appears in the partition section. Non-partitioned tables get is_partition_column=False on all columns (no regression). ingestion/src/metadata/ingestion/source/database/hive/metadata.py - Add get_table_partition_details(table_name, schema_name, inspector) to HiveSource. It reads the is_partition_column flag produced by get_columns and builds a TablePartition with PartitionColumnDetails for each partition key, using PartitionIntervalTypes.COLUMN_VALUE (Hive partitions are value-based folder segments, not time/integer intervals). - Returns (False, None) gracefully for non-partitioned tables and on any unexpected error. ingestion/tests/unit/topology/database/test_hive.py - Add TestHivePartitionKeyFlag with 8 unit tests covering: - _get_partition_column_names: no-partition table, single key, multiple keys, and correct exit on the next section header. - get_columns: single partition key flagged, non-partitioned table, multiple partition keys, all with correct is_partition_column values.
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
|
||
| # Sentinel header names that appear in DESCRIBE FORMATTED output; these rows | ||
| # are metadata rows, not real columns. | ||
| _DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"} |
There was a problem hiding this comment.
💡 Quality: Unused module-level constant _DESCRIBE_SECTION_HEADERS
The constant _DESCRIBE_SECTION_HEADERS defined at line 27 is never referenced anywhere in the codebase. It appears to be a leftover from an earlier design where section header detection was centralised. Since _get_partition_column_names and get_columns both use inline string literals instead, this constant is dead code.
Suggested fix:
Remove the unused constant:
-_DESCRIBE_SECTION_HEADERS = {"# Partition Information", "# col_name"}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 👍 Approved with suggestions 0 resolved / 1 findingsImplements 💡 Quality: Unused module-level constant
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Summary
Fixes #26712
The Hive connector mixed partition columns (folder-path segments like
year,country) with regular data columns in the ingested schema, making it impossible to distinguish partition keys from data columns. This made it hard to write efficient queries and understand table organization. The two existing PRs (#27029 and #27278) addressed this but have not landed — this is a clean reimplementation.Changes
ingestion/src/metadata/ingestion/source/database/hive/utils.py_get_partition_column_names(rows): a strict state-machine parser that reads the# Partition Informationsection ofDESCRIBE FORMATTEDoutput and returns the set of partition column names. Exits cleanly on the next#-prefixed section header (e.g.# Detailed Table Information) to avoid false positives fromOwner:/Location:rows.get_columnsnow calls_get_partition_column_namesbefore the main loop and attachesis_partition_column=Trueto every column that appears in the partition section. Non-partitioned tables getis_partition_column=Falseon all columns — no regression to existing behaviour.ingestion/src/metadata/ingestion/source/database/hive/metadata.pyget_table_partition_details(table_name, schema_name, inspector)toHiveSource. It reads theis_partition_columnflag produced byget_columnsand builds aTablePartitionwithPartitionColumnDetailsfor each partition key, usingPartitionIntervalTypes.COLUMN_VALUE(Hive partitions are value-based folder segments, not time/integer intervals).(False, None)gracefully for non-partitioned tables and on any unexpected error — ingestion never breaks for users without partitioned tables.ingestion/tests/unit/topology/database/test_hive.pyAdded
TestHivePartitionKeyFlagwith 8 unit tests covering:_get_partition_column_names: no-partition table, single key, multiple keys, correct exit on the next section header.get_columns: single partition key flagged, non-partitioned table, multiple partition keys — all with correctis_partition_columnvalues.How to test
cd ingestion python -m pytest tests/unit/topology/database/test_hive.py::TestHivePartitionKeyFlag -vAll 8 new tests pass alongside the existing Hive test suite.