Skip to content

Conversation

@mawad-amd
Copy link
Collaborator

@mawad-amd mawad-amd commented Aug 17, 2025

Today if we write:

source_code = """extern "C" {
        void add_constant_1(int* input, int* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }
        void add_constant_2(int* input, int* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }"""
# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string=source_code,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    source_string=source_code,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

The two ExternalFunctions will be associated with two different objects which doesn't make sense. The source code is the same, the compiler flags are the same, and hence we should be smart enough to detect that we need a single code object. This PR adds a cache so we can support multiple kernels from the same object without much overhead or extra lines of code.

@ypapadop-amd
Copy link
Collaborator

Can we instead of having it as a cache to have it as an explicit object that manages an archive? E.g.,

archive = ExternalArchive(
    file_name="my_archive.o",
    source_string=source_code,
    compile_flags=[
        ...
    ],
    include_dirs = [
        ...
    ]
)

# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

For non-trivial examples, we'll have include directories and multiple flags that could lead to unexpected behavior if we duplicate them and they're not in sync.

@mawad-amd
Copy link
Collaborator Author

The include directories and compiler flags are part of the hash that gets computed which ends up being the code object name. If everything does not exactly match, you will get two different code objects and the worker class will complain that you are using two code objects. This behavior is well defined here.

@ypapadop-amd
Copy link
Collaborator

It's still a less friendly developer experience to have to duplicate flags and include directories for something that should be managed centrally.

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

@mawad-amd
Copy link
Collaborator Author

mawad-amd commented Aug 18, 2025

It's still a less friendly developer experience to have to duplicate flags and include directories for something that should be managed centrally.

Sure, this PR just enables current code to work when it should. Eventually we can introduce an "ExternalModule" or "ExternalCodeObject". This PR is not doing this.

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

What do you mean? User source code can come from string or a file. See tests.

@mawad-amd
Copy link
Collaborator Author

Can we instead of having it as a cache to have it as an explicit object that manages an archive? E.g.,

archive = ExternalArchive(
    file_name="my_archive.o",
    source_string=source_code,
    compile_flags=[
        ...
    ],
    include_dirs = [
        ...
    ]
)

# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

For non-trivial examples, we'll have include directories and multiple flags that could lead to unexpected behavior if we duplicate them and they're not in sync.

Yeah, this is a different PR. Feel free to code it up.

@ypapadop-amd
Copy link
Collaborator

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

What do you mean? User source code can come from string or a file. See tests.

If I understand it correctly, it writes something (code?) into the directory that the kernel is in. Isn't that correct?

@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Aug 18, 2025

Yeah, this is a different PR. Feel free to code it up.

Is there a way to disable this support? If not, it will break my workflow and it's not acceptable to break users' workflows.

@mawad-amd
Copy link
Collaborator Author

Old behavior should work fine. If you are specifying the code object name, you override everything. Please test it and let me know if it breaks your code.

@mawad-amd
Copy link
Collaborator Author

@ypapadop-amd, any update?

@mawad-amd mawad-amd requested a review from jgmelber August 19, 2025 00:30
@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Aug 19, 2025

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

@mawad-amd
Copy link
Collaborator Author

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Ok, I can fix that. Can you share a test for your use case? We can make sure it always works.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

We would need a bunch of if not hasattr(ExternalFunction, "_cache")? It doesn't look cleaner to me.

@mawad-amd
Copy link
Collaborator Author

@ypapadop-amd Hopefully d89d954 make things work now. I would like a test/description of one whenever you have some time.

@ypapadop-amd
Copy link
Collaborator

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Ok, I can fix that. Can you share a test for your use case? We can make sure it always works.

It may be difficult to write a test case outside of what I'm doing since it's a different build process. I can try but it's extra churn. This is why I've been suggesting a version of iron.jit that returns artifacts instead of running them; we'd avoid this extra back and forth.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

We would need a bunch of if not hasattr(ExternalFunction, "_cache")? It doesn't look cleaner to me.

Does this attribute need to be a member of ExternalFunction? It seems it's only applicable to the compilation process.

I also don't understand why we'd need a number of them, since there's a single loop that does the compilation/linking and once you've done that you don't need to revisit it. A function that calculates the key on demand and then caches it would also do this or even the original suggestion of an archive class which would mirror what traditional toolchains do today.

In any case, since I'm using explicit object_file_names it's fine for my use-case but I can see how it can break outside iron.jit.

@mawad-amd
Copy link
Collaborator Author

You are welcome to change the JIT backend to support managing the code objects. As long as the abstraction is the same, that’s fine by me.

The behavior that this PR enable should just work. It doesn’t make sense to have same code, same compiler flags and you end up with two different code objects for JIT. Archive idea is fine but doesn’t solve the problem.

@ypapadop-amd
Copy link
Collaborator

You are welcome to change the JIT backend to support managing the code objects. As long as the abstraction is the same, that’s fine by me.

The behavior that this PR enable should just work. It doesn’t make sense to have same code, same compiler flags and you end up with two different code objects for JIT. Archive idea is fine but doesn’t solve the problem.

I never suggested that you should have different code objects, which is not supported anyway.

I will debate programmability though. This PR requires duplication of compile options at the creation of ExternalFunction even if they don't apply for that particular compiled function but still requires centralized function definition. I cannot think of another system with a similar behavior.

I don't think that following is valid, but it could be if you'd defer all caching and related until compilation:

tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string="""extern "C" {
        void add_constant_1(T1* input, T1* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
    "add_constant_2",
    source_string="""extern "C" {
        void add_constant_2(T2* input, T2* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

An archive would be on the other side of the spectrum as shown in #2509 (comment)

@fifield
Copy link
Collaborator

fifield commented Aug 19, 2025

I don't think that following is valid, but it could be if you'd defer all caching and related until compilation:

tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string="""extern "C" {
        void add_constant_1(T1* input, T1* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
    "add_constant_2",
    source_string="""extern "C" {
        void add_constant_2(T2* input, T2* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

It's counter intuitive (broken?) that this doesn't already work.

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Aug 19, 2025

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

I thought we can only link against a single .o. If it's indeed a limitation, ld --relocatable can combine .o files.

@mawad-amd
Copy link
Collaborator Author

mawad-amd commented Aug 19, 2025

tile_size = 16
add_one_function = ExternalFunction(
"add_constant_1",
source_string="""extern "C" {
void add_constant_1(T1* input, T1* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 1;
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
"add_constant_2",
source_string="""extern "C" {
void add_constant_2(T2* input, T2* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 2;
}
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
input_tensor, output_tensor, add_one_function, add_two_function
)
It'll require stitching the source code strings together and probably detect duplicate macro definitions.

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

(Jeff edit: somehow I hit edit instead of reply on original comment and messed it up, I think I've mostly restored it)

@fifield
Copy link
Collaborator

fifield commented Aug 19, 2025

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

Existing programming example use case: combine and use

@ypapadop-amd
Copy link
Collaborator

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

This PR solves a corner case of what we have been discussing but adds logic that would have been avoided if we solve the core issue of not having an interface to express composition of object files and creating of archives.

@jgmelber
Copy link
Collaborator

Have you considered two ExternalFunctions passed to the same Worker?

@mawad-amd
Copy link
Collaborator Author

Have you considered two ExternalFunctions passed to the same Worker?

Do you have an example? The example I saw (and wrote the test after) was using two workers.

@jgmelber
Copy link
Collaborator

jgmelber commented Aug 20, 2025

Have you considered two ExternalFunctions passed to the same Worker?
Do you have an example? The example I saw (and wrote the test after) was using two workers.

This is the case I am thinking about:

ifeq ($(devicename),npu)
build/lut_based_ops.o: ${aie2_runtime_dir}/lut_based_ops.cpp
mkdir -p ${@D}
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -I. -c $< -o ${@F}
endif
build/%.cc.o: %.cc
mkdir -p ${@D}
ifeq ($(devicename),npu)
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -I. -I${aie2_runtime_dir} -c $< -o ${@F}
else ifeq ($(devicename),npu2)
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2P_FLAGS} -c $< -o ${@F}
else
echo "Device type not supported"
endif
ifeq ($(devicename),npu)
build/kernels.a: build/${targetname}.cc.o build/lut_based_ops.o
ar rvs $@ $+
else ifeq ($(devicename),npu2)
build/kernels.a: build/${targetname}.cc.o
ar rvs $@ $+
else
echo "Device type not supported"
endif

@mawad-amd
Copy link
Collaborator Author

mawad-amd commented Aug 20, 2025

Have you considered two ExternalFunctions passed to the same Worker?

Yes, this is the same suggestion Jeff made. Will check it out.

@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Aug 22, 2025

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

@mawad-amd @fifield @jgmelber I made a janky pass of this in #2530

@jgmelber
Copy link
Collaborator

Is this redundant to #2611

@mawad-amd
Copy link
Collaborator Author

No, it's not. #2661 caches the kernel launch object whereas this PR caches the external kernel compiled code object.

@andrej
Copy link
Collaborator

andrej commented Sep 26, 2025

I might be misunderstanding the discussion here, please correct me if I am. I also don't want to derail this PR, so if this ends up warranting more discussion, let's move it elsewhere.

That being said, I believe at a high level:

  1. We need a way to track dependencies (kernel objects, archives, ...) of designs.
  2. If one dependency exports multiple symbols, it should be only built once, not once for each symbol. (This PR.)
  3. We only want to re-build these dependencies when they change. (Caching)
  4. Ideally, we'd also like to share these dependencies between multiple designs so we don't build the same code many times.
  5. Dependencies should be defined not only by their source code but also by their compilation flags.
  6. Some dependencies might be more difficult/require more steps to build than others (e.g. archives), so it would be nice to have a way to specify arbitrary required build steps for flexibility.

This PR addresses point 2. All the other points listed above are also challenges that might need to be adressed in the future as we continue to go down this path of building our designs straight from Python.

I believe all of this has been solved with existing build systems like CMake. I understand it might feel ugly or kludgey to call into a build system from Python, but I do think the idea warrants maybe a bit more discussion.

Now given the effort already put into JIT, I understand that moving to use CMake would require considerable refactoring of the existing JIT flow, but I will venture to predict that solving all build system problems/challenges in the Python implementation will also continue to be a lot of work going forward.

I think JIT could be a great frontend to generate a list of targets that are needed by a computation graph. For example, the output of JIT at the capture_graph() stage can be a list of CMake targets. (The user doesn't have to see this; we can call CMake from within Python to then immediately build those targets, e.g. in the compile() step.) Each target would need to be uniquely named for the configuration needed, so perhaps that will be a challenge here. But at that point, invoking CMake, would take care of most of the points I listed above (caching, dependency tracking, ...).

I believe this would also allow us to cleanly bundle how to compile/use a design together with the design, rather than baking that into the JIT code. Decoupling the user (JIT) from the provider of the design, and encapsulating how to build the design with the design, would also allow other flows to more easily reuse existing operators.

Another interesting use case would be deployment. If running the JIT capture_graph() gives us a CMakeLists with targets needed for the specific graph/model that it was run on, we can build that ahead-of-time and then directly ship only the compiled code to the user (without the compiler).

Again, sorry for commenting something a little besides the point of this PR, and my apologies if this has already been discussed. But please do let me know if anyone has thoughts on this or counterpoints on where the suggested approach would fail that the current approach handles more carefully. I'm probably not considering all angles here.

@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Sep 27, 2025

Calling CMake from Python is an interesting approach, but creating CMakeLists on the fly is going to be a pain. There'd also be a lot of steps that you can fail, given there are 3 steps, configure, generate, build that CMake needs and each can fail for different reasons. If we consider different build systems, SCons or some other Python-based build systems may be easier to integrate.

I believe this would also allow us to cleanly bundle how to compile/use a design together with the design, rather than baking that into the JIT code. Decoupling the user (JIT) from the provider of the design, and encapsulating how to build the design with the design, would also allow other flows to more easily reuse existing operators.

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

@andrej
Copy link
Collaborator

andrej commented Sep 29, 2025

If we consider different build systems, SCons or some other Python-based build systems may be easier to integrate.

Yes, I'd be on board with considering other build systems as well. I'm not well versed in the space but just see a lot of overlap.

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

What we've done elsewhere is create a CMake function like add_aie_gemm() that allows you to specify the particular parameters for the operator you want to generate. That allows for a clean interface between user and producer, since behind the function, you can do whatever you need to generate your targets (e.g., create archives), while only exposing the parameters to the user (i.e., the JIT user would only need to specify the required GEMM parameters). If then you need that operator elsewhere (outside of JIT), you can use the same CMake function, without having to worry about how to build it. My impression is that the build instructions of the operators are currently more intertwined with the JIT implementation, so extracting an operator out of the JIT flow when you need it elsewhere might be harder (?). I'll admit this is just from a superficial view of how the JIT flow is built up, so please feel free to correct me.

@ypapadop-amd
Copy link
Collaborator

ypapadop-amd commented Sep 29, 2025

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

What we've done elsewhere is create a CMake function like add_aie_gemm() that allows you to specify the particular parameters for the operator you want to generate. That allows for a clean interface between user and producer, since behind the function, you can do whatever you need to generate your targets (e.g., create archives), while only exposing the parameters to the user (i.e., the JIT user would only need to specify the required GEMM parameters). If then you need that operator elsewhere (outside of JIT), you can use the same CMake function, without having to worry about how to build it. My impression is that the build instructions of the operators are currently more intertwined with the JIT implementation, so extracting an operator out of the JIT flow when you need it elsewhere might be harder (?). I'll admit this is just from a superficial view of how the JIT flow is built up, so please feel free to correct me.

I've done this as well and it was a maintenance burden. There's no way to express a contract such as API, parameter names, etc., when you cross the boundary between language and CMake. E.g., pybind11 allows you to create and name parameters in C++ and pass them to a Python function, there's no such capability for CMake integration; this will need to be added and may not be trivial.

Using CMake as the JIT manager creates a latency issue too, if it's the source of truth for detecting if a JIT'd function is up-to-date.

@ypapadop-amd
Copy link
Collaborator

Bumping this; do we have any way forward with this? I was trying to use ExternalFunction in the matmul example and it's not a great experience:

  1. ExternalFunction objects need to be explicitly resolved for placed designs,
  2. I'm relying on a really bad approach in my code to be able to support two functions in the same archive.
def create_mat_mul_external_functions(
    arch: str,
    input_tensors: list,
    output_tensor,
):
    m = 8
    n = 8
    k = 8
    use_scalar = False
    scalar_suffix = "_scalar" if use_scalar else ""

    num_cols = None
    if arch == "aie2":
        num_cols = 4
    elif arch == "aie2p":
        num_cols = 8
    else:
        raise ValueError(f"Unsupported architecture: {arch}")

    current_dir = path.dirname(path.realpath(__file__))
    source_file = path.join(current_dir, arch, "mm.cc")
    compile_args = [
        f"-DDIM_M={m}",
        f"-DDIM_N={n}",
        f"-DDIM_K={k}",
        f"-D{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}_ONLY",
        "-DB_COL_MAJ",
        "-DC_COL_MAJ",
    ]
    object_file_name = "matmul_core_functions.o"

    zero_fn = ExternalFunction(
        name=f"zero{scalar_suffix}_{dtype_to_str(output_tensor.dtype)}",
        object_file_name=object_file_name,
        source_file=source_file,
        arg_types=[np.ndarray[(m, n), np.dtype[output_tensor.dtype]]],
        compile_flags=compile_args,
    )

    matmul_fn = ExternalFunction(
        name=f"matmul{scalar_suffix}_{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}",
        object_file_name=object_file_name,
        source_file=source_file,
        arg_types=[
            np.ndarray[(m, k), np.dtype[input_tensors[0].dtype]],
            np.ndarray[(k, n), np.dtype[input_tensors[0].dtype]],
            np.ndarray[(m, n), np.dtype[output_tensor.dtype]],
        ],
        compile_flags=compile_args,
    )

    return (
        m,
        n,
        k,
        use_scalar,
        num_cols,
        zero_fn,
        matmul_fn,
    )

def my_matmul(arch: str, input_tensors: list, output_tensor)
   m, n, k, use_scalar, num_cols, zero_fn, matmul_fn = (
        create_mat_mul_external_functions(
            arch=arch, input_tensors=input_tensors, output_tensor=output_tensor
        )
    )

    with mlir_mod_ctx() as ctx:
        mat_mul_fn(
            dev=dev,
            M=A.shape[1],
            N=B.shape[1],
            K=A.shape[0],
            m=m,
            n=n,
            k=k,
            n_aie_cols=num_cols,
            dtype_in_str=dtype_to_str(A.dtype),
            dtype_out_str=dtype_to_str(C.dtype),
            b_col_maj=True,
            c_col_maj=True,
            use_scalar=use_scalar,
            emulate_bf16_mmul_with_bfp16=False,
            trace_size=0,
            zero_fn=zero_fn._name,
            matmul_fn=matmul_fn._name,
            object_file=matmul_fn.bin_name,
        )
        return ctx.module

@jgmelber
Copy link
Collaborator

jgmelber commented Nov 5, 2025

Bumping this as well. What's the latest status on this? @mawad-amd @andrej @ypapadop-amd

@hunhoffe hunhoffe added this to the JIT/Runtime/Compilation Refactor milestone Nov 10, 2025
@hunhoffe hunhoffe removed this from the JIT/Runtime/Compilation Refactor milestone Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants