Skip to content

Conversation

@V-FEXrt
Copy link
Collaborator

@V-FEXrt V-FEXrt commented Nov 7, 2025

Fixes #7846

Implments:

  • the HLSL type __builtin_la_MatrixRef
  • the HLSL intrinsic __builtin_la_CreateMatrix()
  • the DXIL type %dx.types.MatrixRef
  • the DXIL op @dx.op.createMatrix(i32 312)

Note: The DXIL op codes will change in response to microsoft/hlsl-specs#698

@@ -0,0 +1,8 @@
// RUN: %dxc -T cs_6_9 -E main %s -verify
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure I also want a .ll test but I don't know the magic incantation to generate the IR upto but right before the dxilgen pass

Copy link
Collaborator

@pow2clk pow2clk Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./utils/hct/ExtractIRForPassTest.py -p dxilgen -o output.ll input.hlsl -- -T cs_6_9 -E main

Copy link
Member

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through this and didn't spot any obvious typos.

Do we need to apply the same "experimental DXIL" stuff to this that we're talking about for other SM 6.10 features?

Comment on lines +140 to 144
LICOMPTYPE_VK_BUFFER_POINTER = 55,
LICOMPTYPE_COUNT = 56
#else
LICOMPTYPE_COUNT = 54
LICOMPTYPE_COUNT = 55
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that these values have changed?

I've no reason to think it does, just checking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming no but maybe someone else can weigh in. The ifdef basically forces them to change though since the last number is only incremented for SPIRV. Beyond that this seems to be an internal enum so changing it shouldn't be observable to end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mostly internal details, but the built-in intrinsic extension mechanism relies on this header (and this enum) as well. That's currently only used by Xbox backend, and wouldn't be impacted by these changes unless updated to depend on them, and even then, using the latest header would keep it in sync.

Ideally, I think we should get rid of the ifdef here (so LICOMPTYPE_VK_BUFFER_POINTER is always part of the enumeration), and update the corresponding table g_LegalIntrinsicCompTypes in SemaHLSL.cpp. But that should be a separate change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the ifdef SPIRV conditionals were removed in similar areas. This one was missed possibly as an oversight.

@V-FEXrt
Copy link
Collaborator Author

V-FEXrt commented Nov 7, 2025

Do we need to apply the same "experimental DXIL" stuff to this that we're talking about for other SM 6.10 features?

Yep

@pow2clk
Copy link
Collaborator

pow2clk commented Nov 24, 2025

In a discussion about a month ago, we agreed that createMatrix as written wasn't useful because it has no way of uniquely identifying the matrix in question, so any and all createMatrix calls will get merged as duplicates. From this discussion, we agreed to remove it, which would make the builtin unnecessary.

In my own further investigation that I've been trying to get the opportunity to discuss for weeks, I've determined that we may need a createMatrix that is generated for local or global allocas, but not parameters since, without global variables, we have to use the alloca as the unique identifier and the fact that they are opaque handles makes it difficult to cleanly merge duplicate parameter allocas through inlining and subsequent passes.

In any event, a creatematrix call that only takes the opcode isn't of much use.

@pow2clk
Copy link
Collaborator

pow2clk commented Dec 4, 2025

Per our discussion, you can disregard my last comment, however I just remembered that when I proposed a builtin for creatematrix and annotatematrix, @llvm-beanz rejected it on account of it would allow the header implementation or other method using builtins to generate invalid dxil. I found that a compelling argument. I don't know if he changed his mind since.

Similar create dxil ops are generated as part of clang CodeGen. As an example I wouldn't strongly recommend following exactly, but as the broad idea, see how resource handle creation is generated:

Value *Handle = CreateHandleFromResPtr(arg, HLM, HandleTy, Builder);

The way nodehandle creation is generated is the way I would/have approached it though it will be through local/global variables instead of entry function params:

TranslateInputNodeRecordArgToHandle(HLM, NodeInputRecordParams);
TranslateNodeOutputParamToHandle(HLM, NodeOutputParams);

@llvm-beanz
Copy link
Collaborator

Per our discussion, you can disregard my last comment, however I just remembered that when I proposed a builtin for creatematrix and annotatematrix, @llvm-beanz rejected it on account of it would allow the header implementation or other method using builtins to generate invalid dxil. I found that a compelling argument. I don't know if he changed his mind since.

I think a builtin for createMatrix is fine, and probably necessary. The exact shape it takes might need to shift because we may want to have a single builtin function that generates both a create and an annotate, but what is in this PR is a fine first step.

The annotate builtin is a bit more complicated but it isn't the "creating invalid dxil" that concerns me as much as I'm unsure that the annotate builtin could be used to create valid DXIL. I'm not strongly opposed to either of these additions a lot of it just depends on the implementation details as we progress.

@pow2clk
Copy link
Collaborator

pow2clk commented Dec 4, 2025

The annotate builtin is a bit more complicated but it isn't the "creating invalid dxil" that concerns me as much as I'm unsure that the annotate builtin could be used to create valid DXIL. I'm not strongly opposed to either of these additions a lot of it just depends on the implementation details as we progress.

Are you saying that create will have a builtin, but annotate will not?

As a sanity check, where do you see the create builtin being called in the linalg.h header?

@V-FEXrt
Copy link
Collaborator Author

V-FEXrt commented Dec 5, 2025

Are you saying that create will have a builtin, but annotate will not?

Currently the plan is that they both have builtins, and the annotate is just overemitted and then cleaned up/deduped in a late pass

As a sanity check, where do you see the create builtin being called in the linalg.h header?

Create and initial annotate are in the constructor of the Matrix type

@pow2clk
Copy link
Collaborator

pow2clk commented Dec 8, 2025

Using the implicit constructor for cases where the matrix is just declared and not initialized makes sense. I requested/suggested further explanation of expected behavior for uninitialized matrices in microsoft/hlsl-specs#691. I haven't heard back about that issue. I considered that a default splat of zeros or something would give a reasonable outcome. For what I've taken to calling the "matrix population functions" (loads, splat, cast, and outerproduct), were you still thinking of relying on the constructor or calling createMatrix and annotateMatrix directly there? If not, I don't see how either builtin would ever be explicitly called from HLSL.

The constructor presents a complication if we allow these to be declared in global scope. I said in microsoft/hlsl-specs#695 that I assume that cbuffer is out, but statics might be in. I haven't heard back about that issue. Is the intent to allow statics?

@llvm-beanz said that annotate is more complicated and it is. It may have some relation to the createMatrix approach. I presume you meant you were going to call it from the functions taking a matrix parameter, but if the user creates a function that takes a dx::linalg::Matrix parameter, there will need to be an annotate on that. How were you thinking about inserting that annotate?

Overall, this sounds like my rejected proposal of two months ago. I'm not against it, I'm just trying to see if that's the case and whether there are other details that need considering before committing to it.

Copy link
Collaborator

@pow2clk pow2clk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing out a potential pitfall for annotateMatrix. If you've started on that work and already have a solution, I'd be interested to know how it works without templated type information.

CXXRecordDecl *hlsl::DeclareMatrixRefType(ASTContext &Context) {
// MatrixRef { ... }
BuiltinTypeDeclBuilder TypeDeclBuilder(Context.getTranslationUnitDecl(),
"__builtin_la_MatrixRef");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked this before, without template arguments, I don't know where you will get the property information that the matrix provides as it is solely stored in the header-defined type that contains this type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I guess this relates to my previous question of whether annotateMatrix is expected to be called explicitly from HLSL. in that case, the type information from the header would be available, but that's not available for user function parameters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property information would be tracked by the templated header class and forwarded along to annotate as needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have that working in a branch somewhere? I just don't see how that information would be available to clang in a user function needing an annotate without adding a dependency at the compiler implementation level to a class written in an HLSL header, which seems like a bad idea.

Copy link
Collaborator Author

@V-FEXrt V-FEXrt Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't no. I was planning on getting deep into things once I got back but this whole discussion preempted that :)

I'm not sure where you see a potential compiler dependency on the class though, this works (although optimization seems to be deleting everything) without any special knowledge from the compiler and it is pretty much a strawman of what I plan to do. https://godbolt.org/z/dnYfn7eTv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that. That's what I suspected. That works fine for methods of the matrix class defined in the header, but what about user functions that pass matrices around? They shouldn't pass the MatrixRef around, but there shouldn't be any reason they can't pass the dx::linalg::matrix class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have specific user functions in mind? The whole point of dx::linalg::Matrix is to restrict the matrix usage down to only the allowed operations. Anything that a user would want to do with the MatrixRef should be accessable via the header API

void [[min_sm=6.10]] __builtin_VectorAccumulate(in LinAlg<c> InputVector, in RWByteAddressBuffer MatrixBuffer, in uint MatrixOffset);


MatrixRef [[min_sm=6.10]] __builtin_la_CreateMatrix();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my previous comment, if MatrixRef has template parameters, then it can't be uniquely a return type. It can be done with an output variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is for MatrixRef to not have template parameters

@tex3d
Copy link
Contributor

tex3d commented Dec 8, 2025

While the current plan (as I discussed with @V-FEXrt) is to add the annotate call to every method that consumes or produces a matrix, I'm not so sure the memory semantics for the linalg::Matrix type are entirely worked out in a consistent way.

I think the crux of the problem is that in HLSL the object looks like it is the memory for the matrix, or at least automatically manages the memory. In DXIL, this translates to a dx.types.MatrixRef, which just refers to a matrix. Presumably the MatrixRef value in DXIL does not represent the actual memory allocated, but the dx.op.createMatrix operation essentially allocates the matrix referred to by the MatrixRef. So, how do we track allocation lifetime? What happens when you assign one linalg::Matrix to another, or pass it through a function parameter (which copies it)? What rules guarantee the driver can track lifetimes and eliminate copies or re-use memory?

For instance, depending on the linalg::MatrixScope, the linalg::Matrix could refer to per-thread matrix loaded from global mem, which is meant to be folded into an operation through optimization without actually allocating any local memory. Or it could refer to local per-thread fragments making up a wave-scope matrix, or to groupshared memory, which isn't counted in our normal groupshared memory accounting for a shader.

@pow2clk
Copy link
Collaborator

pow2clk commented Dec 8, 2025

While the current plan (as I discussed with @V-FEXrt) is to add the annotate call to every method that consumes or produces a matrix, I'm not so sure the memory semantics for the linalg::Matrix type are entirely worked out in a consistent way.

I think the crux of the problem is that in HLSL the object looks like it is the memory for the matrix, or at least automatically manages the memory. In DXIL, this translates to a dx.types.MatrixRef, which just refers to a matrix. Presumably the MatrixRef value in DXIL does not represent the actual memory allocated, but the dx.op.createMatrix operation essentially allocates the matrix referred to by the MatrixRef. So, how do we track allocation lifetime? What happens when you assign one linalg::Matrix to another, or pass it through a function parameter (which copies it)? What rules guarantee the driver can track lifetimes and eliminate copies or re-use memory?

For instance, depending on the linalg::MatrixScope, the linalg::Matrix could refer to per-thread matrix loaded from global mem, which is meant to be folded into an operation through optimization without actually allocating any local memory. Or it could refer to local per-thread fragments making up a wave-scope matrix, or to groupshared memory, which isn't counted in our normal groupshared memory accounting for a shader.

These are interesting issues. At least in the case of cast, I expected that the output matrixref was distinct from the one passed in. I was expecting that we'd follow the intuition of resources in which case the handle does reference the memory, but this is different in that the operations don't just sample from that memory, but completely populate or dump it to memory. The decisions made here might affect whether a default splat for uninitialized matrices like I described above will be possible or not because it will determine if the compiiler can identify that the splat isn't necessary if it is subsequently assigned. Here I think the recently added hitobjects are illustrative as they default to a noop assignment, but can be eliminated if it is subsequently assigned to something that overrides it.

Interesting as all that is, can you help me understand how it relates to the question at hand? Regardless of whether they refer to the memory, I would think the matrixref would be consigned to the type that its parent class was declared as. Perhaps it was in reference to something else I said?

Are you saying that user functions might not need annotates for their parameters or that there is another way to solve that problem?

@tex3d
Copy link
Contributor

tex3d commented Dec 9, 2025

I opened this issue microsoft/hlsl-specs#756 against the spec, because I think the root issue is a lack of clear memory semantics/lifetime/ownership definition in the spec for the HLSL object and the matrix data it references.

@tex3d
Copy link
Contributor

tex3d commented Dec 9, 2025

These are interesting issues. At least in the case of cast, I expected that the output matrixref was distinct from the one passed in.

The explicit operations aren't the problem, since they would simply take these handles that refer to some memory managed elsewhere. The problem is the management of that memory, and how the HLSL code translates to the correct number of objects with appropriate lifetimes that the driver can reason about in all cases.

I was expecting that we'd follow the intuition of resources in which case the handle does reference the memory

The key difference from resources is that a resource handle references something externally managed, rather than something managed by the shader itself.

Here I think the recently added hitobjects are illustrative as they default to a noop assignment, but can be eliminated if it is subsequently assigned to something that overrides it.

HitObject is well-defined because the object is immutable, and the storage requirement is fixed (though device dependent) and local. The driver compiler can replace dx.types.HitObject with the internal device-dependent type, and everything is correct. The noop assignment just gives the object a well-defined initial value.

Interesting as all that is, can you help me understand how it relates to the question at hand? Regardless of whether they refer to the memory, I would think the matrixref would be consigned to the type that its parent class was declared as. Perhaps it was in reference to something else I said?

Are you saying that user functions might not need annotates for their parameters or that there is another way to solve that problem?

No, the plan was to annotate on input and on create (constructor would create+annotate) for every operation. Annotate on input includes this and is necessary for when the call is in another function, or there's some other distance between the initial create+allocate and the use. This ensures that the operation has preserved the local type information in all cases, even if there's an assignment through an alloca (even a dynamic one). Every operand of a local DXIL operation would have come from an annotate handle which describes the type information necessary to interpret the operation.

To illustrate, there would be something like a meant-for-internal-use _getAnnotatedHandle() method on the linalg::Matrix class, which is used by this class to ensure preservation of type information:

namespace linalg {
template <...>
class Matrix {
  // Meant to be private, but HLSL doesn't support private:
  MatrixRef _Matrix;
  MatrixRef _getAnnotatedHandle() {
    return __builtin_la_AnnotateMatrix(_Matrix, <matrix properties...>);
  }

  // Public part of the interface:
  Matrix() {
    _Matrix = __builtin_la_CreateMatrix();
    _Matrix = _getAnnotatedHandle();
  }
  
  template <ComponentType NewCompTy, MatrixUse NewUse = Use>
  Matrix<NewCompTy, M, N, NewUse, Scope> Matrix::Cast() {
    Matrix<NewCompTy, M, N, NewUse, Scope> Dest; // will create+annotate
    // Need to annotate our own handle locally for the operation:
    __builtin_la_CastMatrix(Dest._Matrix, _getAnnotatedHandle());
    return Dest;
  }

  template <ComponentEnum LHSTy, ComponentEnum RHSTy, uint K,
            MatrixUseEnum UseLocal = Use>
  typename hlsl::enable_if<Use == MatrixUse::Accumulator && UseLocal == Use,
                           void>::type
  MultiplyAccumulate(const Matrix<LHSTy, M, K, MatrixUse::A, Scope> A,
                     const Matrix<RHSTy, K, N, MatrixUse::B, Scope> B) {
    __builtin_la_MultiplyAccumulateMatrix(_getAnnotatedHandle(),
                                          A._getAnnotatedHandle(),
                                          B._getAnnotatedHandle());
  }

  ...
};
}

This doesn't solve the memory management problem I'm referring to though.

Copy link
Contributor

@tex3d tex3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated for the experimental DXIL op table changes, however, I had some concerns about the naming of the HLSL intrinsic and DXIL op, and I think other questions may get in the way related to how the current approach will solve the memory management of the matrix contents of any given HLSL Matrix object.

)

self.add_dxil_op(
"CreateMatrix",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateMatrix seems overly generic. What about CreateLinAlgMatrix or CreateMatrixRef?

I'm a bit concerned with this just being "Matrix" when we have a different concept of matrix in HLSL, and with preserving vectors in IR, I could even imagine doing matrix operations on those vectors instead of scalarizing those operations, and needing to differentiate this sort of operation from the more abstract LinAlg MatrixRef.

This has me concerned with the other DXIL op names too, which seem overly generic for how specific they actually are, targeting a specialized hardware feature.

Copy link
Collaborator Author

@V-FEXrt V-FEXrt Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to update this but it's important to know that it would be a spec change and needs to be updated over there first.

https://github.com/microsoft/hlsl-specs/blob/main/proposals/0035-linalg-matrix.md#dxil-operations

)
for i in (
"MatVecMul,MatVecMulAdd,OuterProductAccumulate,VectorAccumulate"
"MatVecMul,MatVecMulAdd,OuterProductAccumulate,VectorAccumulate,CreateMatrix"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebase, move the new one to the end of populate_ExperimentalOps:

def populate_ExperimentalOps(self):

% next_op_idx
)

self.add_dxil_op(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebase, move this to the end of populate_categories_and_models_ExperimentalOps:

def populate_categories_and_models_ExperimentalOps(self):

void [[min_sm=6.10]] __builtin_VectorAccumulate(in LinAlg<c> InputVector, in RWByteAddressBuffer MatrixBuffer, in uint MatrixOffset);


MatrixRef [[min_sm=6.10]] __builtin_la_CreateMatrix();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this extra level of abbreviation is helpful: la.
Can we just use linalg instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New

Development

Successfully merging this pull request may close these issues.

Enable creation of a LinAlg Matrix (type, intrinsic, dxil op)

5 participants