Skip to content

Conversation

@EpsilonPrime
Copy link
Contributor

Moved the output schema used to solve the ClickHouse output schema nullability mismatch into an extension instead of being a core feature of Substrait. This helps align the Gluten fork of Substrait with that of the core project.

- Removed output_schema field from RelRoot in algebra.proto
- Added advanced_extension field to RelRoot
- Created gluten_extensions.proto with RelRootOutputSchema extension
- Updated ClickHouse parser to read output_schema from extension
- Updated Java PlanNode to pack output_schema into extension
- Updated documentation to reflect the change

The output_schema field was only used by the ClickHouse backend to
preserve nullability information. By moving it to an extension, we
follow Substrait best practices for backend-specific functionality.
- Moved proto file from substrait/proto/substrait/extensions/ to io/glutenproject/proto/
- Changed package from 'substrait.extensions' to 'gluten'
- Updated Java package to 'io.glutenproject.proto'
- Updated C++ includes from substrait/extensions/gluten_extensions.pb.h to gluten_extensions.pb.h
- Updated namespace references from substrait::extensions to gluten in C++ code
- Updated Java imports to use io.glutenproject.proto.GlutenExtensions

This keeps Gluten-specific extensions separate from upstream Substrait
proto files and follows the existing Gluten proto conventions.
Renamed RelRootOutputSchema to CHExpectedOutputSchema and improved
documentation to make it clear this is a ClickHouse-specific workaround,
not a general output schema mechanism.

Key clarifications:
- The output schema can already be computed from any Substrait plan
- This extension exists because ClickHouse's plan conversion doesn't
  preserve nullability correctly (issue-1874)
- Rather than fix ClickHouse's complex conversion logic, we provide
  the expected schema so ClickHouse can insert casts to correct
  nullability mismatches
- Not needed by backends like Velox that preserve types correctly

Changes:
- Renamed message from RelRootOutputSchema to CHExpectedOutputSchema
- Renamed field from output_schema to expected_schema
- Added detailed comments explaining the workaround nature
- Updated all Java and C++ code references
- Enhanced documentation to explain this is not a general feature
- Improved error messages and comments in C++ code
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE DOCS labels Nov 6, 2025
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

Run Gluten Clickhouse CI on x86

@EpsilonPrime
Copy link
Contributor Author

Will solve this by removing output_schema instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core DOCS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants