Skip to content

Conversation

@ingomueller-net
Copy link
Collaborator

@ingomueller-net ingomueller-net commented Feb 4, 2025

Once rebased onto #87, this PR will need llvm/llvm-project#127518 to make pyright happy.

This PR introduces a new RelationType as the operands and result types of RelOps, where previously tuple had been used as a placeholder. While this changes little in the structure of the IR, it (1) underlines the subtle difference between the relation, a container of records, and the records themselves, which had both been represented as tuples before and (2) it allows defining an even shorter type name, rel, for that type. The latter is implemented using a custom printer and parser for these "pseudo type names" (which are really keywords for the printer and parser), which lays the basis for similar short-hand forms for some of our other types with long type names.

@ingomueller-net ingomueller-net force-pushed the relation-type branch 3 times, most recently from d5e458e to 6939882 Compare February 11, 2025 13:29
@ingomueller-net
Copy link
Collaborator Author

ingomueller-net commented Feb 11, 2025

I am still unsure about what assembly format is best. We have the following possibilities:

%0 = named_table @t1 as ["a"] : <si32>  // no typename by default
%0 = named_table @t1 as ["a"] : !substrait.relation<si32>  // always fully qualified name
%0 = named_table @t1 as ["a"] : relation<si32>  // spelled-out pseudo-typename
%0 = named_table @t1 as ["a"] : rel<si32>  // short pseudo-typename

If no type name is the default, we have to rely on the reader's knowledge that the ops always return relations. If we always print the fully qualified name, some ops get quite verbose (set ops can have several operands plus the result, which would all repeat that it's a relation). The pseudo-typenames would really be a keyword, which (1) would need to be defined for each op and (2) wouldn't color as a type in IDEs. Any opinions, @jpienaar or @dshaaban01?

ingomueller-net added a commit to ingomueller-net/substrait-mlir-contrib that referenced this pull request Feb 19, 2025
This PR updates the LLVM submodule to llvm/llvm-project@0de2ccab7b, the
latest version as of today. That version contains a fix for `nanobind`'s
`stubgen` (in llvm/llvm-project#127518), which we require for the typing
stubs for substrait-io#78.

Signed-off-by: Ingo Müller <[email protected]>
ingomueller-net added a commit that referenced this pull request Feb 20, 2025
…88)

This PR updates the LLVM submodule to llvm/llvm-project@0de2ccab7b, the
latest version as of today. That version contains a fix for `nanobind`'s
`stubgen` (in llvm/llvm-project#127518), which we require for the typing
stubs for #78.

This PR requires to update the git modules in existing git clones:

```bash
git submodule update --recursive
```

Signed-off-by: Ingo Müller <[email protected]>
@ingomueller-net
Copy link
Collaborator Author

@mortbopet: Any opinion from your side?

ingomueller-net added a commit to ingomueller-net/substrait-mlir-contrib that referenced this pull request Apr 24, 2025
This includes wjakob/nanobind#939 and wjakob/nanobind#940, which fixes
an issue I encountered while working on substrait-io#78, so we need the new version
of `nanobind` for CI to pass and stub generation to be correct for that.

Signed-off-by: Ingo Müller <[email protected]>
@mortbopet
Copy link
Collaborator

Well, first, i support adding a relation type to get out of the whole aliasing-with-tuples-in-IR issue. But it is a shame that we can't re-use the tuple type as an internal storage mechanism, and have to copy the type storage struct - which appears to me as non-trivial.
Is it not possible to have an implementation that keeps the custom printer/parser, but has a TupleType as parameter? That should take care of inferring a type storage, but we'd then still provide the custom builders for API neatness.

Re. assembly format, given the reasons that you highlight, i'd be in favor of the first (no typename). This would also be in line with substrait operations generally having elided their dialect name (getDialectNamespace). But mainly due to the size of the IR; i've already experienced IR bloat through substrait.decimal types (also a long name!) so adding in even more text would clearly worsen this.

ingomueller-net added a commit that referenced this pull request May 8, 2025
This includes wjakob/nanobind#939 and wjakob/nanobind#940, which fixes
an issue I encountered while working on #78, so we need the new version
of `nanobind` for CI to pass and stub generation to be correct for that.

Signed-off-by: Ingo Müller <[email protected]>
@ingomueller-net
Copy link
Collaborator Author

Is it not possible to have an implementation that keeps the custom printer/parser, but has a TupleType as parameter? That should take care of inferring a type storage, but we'd then still provide the custom builders for API neatness.

Turns out it is and I think it's a good idea. See the latest commit I just pushed. The change is so small that no usage of the op had to changed, so the fact that a TupleType is used under the hood is barely visible :)

Re. assembly format, given the reasons that you highlight, i'd be in favor of the first (no typename). This would also be in line with substrait operations generally having elided their dialect name (getDialectNamespace). But mainly due to the size of the IR; i've already experienced IR bloat through substrait.decimal types (also a long name!) so adding in even more text would clearly worsen this.

I agree with the goal to avoid IR bloat. How about the last option, though? It only adds three characters per type and may avoid a lot of confusion, in particular, for new users. The biggest downside I see is that it is not a real type name but the only practical implication of that that I see is a different color in IDEs, which is probably acceptable. WDYT?

@ingomueller-net
Copy link
Collaborator Author

I though a bit more about the "only a pseudo-type" issue. I think it can be acceptable, and I think it'll provide a better trade-off in other situations as well, namely to reduce the redundancy of nested types, including the field types of !substrait.relation. We have discussed this here for the list type. Concretely, I am thinking to use pseudo-types similar to how the LLVM dialect handles this upstream here:

static StringRef getTypeKeyword(Type type) {
  return TypeSwitch<Type, StringRef>(type)
      .Case<LLVMVoidType>([&](Type) { return "void"; })
      .Case<LLVMPPCFP128Type>([&](Type) { return "ppc_fp128"; })
      .Case<LLVMTokenType>([&](Type) { return "token"; })
     // ...
      .Default([](Type) -> StringRef {
        llvm_unreachable("unexpected 'llvm' type kind");
      });
}

void mlir::LLVM::detail::printType(Type type, AsmPrinter &printer) {
  // ....
  printer << getTypeKeyword(type);
  llvm::TypeSwitch<Type>(type)
      .Case<LLVMPointerType, LLVMArrayType, LLVMFunctionType, LLVMTargetExtType,
            LLVMStructType>([&](auto type) { type.print(printer); });
}

static Type dispatchParse(AsmParser &parser, bool allowAny = true) {
  // ...
  return StringSwitch<function_ref<Type()>>(key)
      .Case("void", [&] { return LLVMVoidType::get(ctx); })
      .Case("ppc_fp128", [&] { return LLVMPPCFP128Type::get(ctx); })
      .Case("token", [&] { return LLVMTokenType::get(ctx); })
      // ...
      .Default([&] {
        parser.emitError(keyLoc) << "unknown LLVM type: " << key;
        return Type();
      })();
}

That could lead to an assembly format like this:

substrait.plan version 0 : 42 : 1 {
  relation {
    %0 = named_table @t1 as ["a"] : rel<timestamp_tz>
    %1 = named_table @t2 as ["b"] : rel<timestamp_tz>
    %2 = join inner %0, %1 : rel<timestamp_tz>, rel<timestamp_tz> -> rel<timestamp_tz, timestamp_tz>
    yield %2 : rel<timestamp_tz>
  }
}

Would that be an acceptable trade-off?

@ingomueller-net
Copy link
Collaborator Author

I though a bit more about the "only a pseudo-type" issue.

I implemented part of this idea in the last commit of this PR and #131 stacked on top of it in order to see betterhow this could look like. @dshaaban01, @mortbopet: what are your impressions?

Copy link
Collaborator

@mortbopet mortbopet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingomueller-net sorry for the delay. Glad to see that there was already a version upstream for handling this kind of scenario. I'm on-board with the shorthand version (#131).

return TypeSwitch<Type, StringRef>(type)
.Case<RelationType>([&](Type) { return "rel"; })
.Default([](Type) -> StringRef {
llvm_unreachable("unexpected 'llvm' type kind");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is llvm intended here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! No, not intended. Will fix in the next iteration.

@ingomueller-net
Copy link
Collaborator Author

@ingomueller-net sorry for the delay. Glad to see that there was already a version upstream for handling this kind of scenario. I'm on-board with the shorthand version (#131).

OK, cool, thanks for the feedback and no worries about delays -- I am have quite a few distractions these days myself... OK, will try to finish this iteration at some point next week.

@ingomueller-net ingomueller-net changed the title feat: implement real relation type [WIP] feat: implement real relation type Jun 10, 2025
This commit introduces a new `RelationType` as the operands and result
types of `RelOp`s, where previously `tuple` had been used as a
placeholder.

Signed-off-by: Ingo Müller <[email protected]>
This required extending and fixing the custom printer and parser to (1)
multiple types and (2) non-Substrait types.
@ingomueller-net
Copy link
Collaborator Author

@mortbopet: I finally got around to wrapping up this PR. Do you want to take another look before I merge?

@ingomueller-net ingomueller-net merged commit 763519d into substrait-io:main Jun 11, 2025
11 checks passed
@ingomueller-net ingomueller-net deleted the relation-type branch June 11, 2025 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants