Add a code object cache for external functions #2509

mawad-amd · 2025-08-17T16:40:08Z

Today if we write:

source_code = """extern "C" {
        void add_constant_1(int* input, int* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }
        void add_constant_2(int* input, int* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }"""
# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string=source_code,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    source_string=source_code,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

The two ExternalFunctions will be associated with two different objects which doesn't make sense. The source code is the same, the compiler flags are the same, and hence we should be smart enough to detect that we need a single code object. This PR adds a cache so we can support multiple kernels from the same object without much overhead or extra lines of code.

ypapadop-amd · 2025-08-18T13:55:19Z

Can we instead of having it as a cache to have it as an explicit object that manages an archive? E.g.,

archive = ExternalArchive(
    file_name="my_archive.o",
    source_string=source_code,
    compile_flags=[
        ...
    ],
    include_dirs = [
        ...
    ]
)

# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

For non-trivial examples, we'll have include directories and multiple flags that could lead to unexpected behavior if we duplicate them and they're not in sync.

mawad-amd · 2025-08-18T14:23:23Z

The include directories and compiler flags are part of the hash that gets computed which ends up being the code object name. If everything does not exactly match, you will get two different code objects and the worker class will complain that you are using two code objects. This behavior is well defined here.

ypapadop-amd · 2025-08-18T14:41:48Z

It's still a less friendly developer experience to have to duplicate flags and include directories for something that should be managed centrally.

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

mawad-amd · 2025-08-18T15:27:31Z

It's still a less friendly developer experience to have to duplicate flags and include directories for something that should be managed centrally.

Sure, this PR just enables current code to work when it should. Eventually we can introduce an "ExternalModule" or "ExternalCodeObject". This PR is not doing this.

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

What do you mean? User source code can come from string or a file. See tests.

mawad-amd · 2025-08-18T15:30:09Z

Can we instead of having it as a cache to have it as an explicit object that manages an archive? E.g.,

archive = ExternalArchive(
    file_name="my_archive.o",
    source_string=source_code,
    compile_flags=[
        ...
    ],
    include_dirs = [
        ...
    ]
)

# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
    "add_constant_2",
    archive=archive,
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

For non-trivial examples, we'll have include directories and multiple flags that could lead to unexpected behavior if we duplicate them and they're not in sync.

Yeah, this is a different PR. Feel free to code it up.

ypapadop-amd · 2025-08-18T15:31:26Z

It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code.

What do you mean? User source code can come from string or a file. See tests.

If I understand it correctly, it writes something (code?) into the directory that the kernel is in. Isn't that correct?

ypapadop-amd · 2025-08-18T15:32:16Z

Yeah, this is a different PR. Feel free to code it up.

Is there a way to disable this support? If not, it will break my workflow and it's not acceptable to break users' workflows.

mawad-amd · 2025-08-18T15:36:03Z

Old behavior should work fine. If you are specifying the code object name, you override everything. Please test it and let me know if it breaks your code.

mawad-amd · 2025-08-19T00:29:59Z

@ypapadop-amd, any update?

ypapadop-amd · 2025-08-19T02:04:19Z

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

mawad-amd · 2025-08-19T02:21:55Z

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Ok, I can fix that. Can you share a test for your use case? We can make sure it always works.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

We would need a bunch of if not hasattr(ExternalFunction, "_cache")? It doesn't look cleaner to me.

mawad-amd · 2025-08-19T02:30:21Z

@ypapadop-amd Hopefully d89d954 make things work now. I would like a test/description of one whenever you have some time.

ypapadop-amd · 2025-08-19T12:33:02Z

@ypapadop-amd, any update?

It appears that initializing an ExternalFunction expects to find the source file from where the initialization happens. We will also pay the cost of reading the source file, even if the key is not useful to a non-JIT flow.

Ok, I can fix that. Can you share a test for your use case? We can make sure it always works.

It may be difficult to write a test case outside of what I'm doing since it's a different build process. I can try but it's extra churn. This is why I've been suggesting a version of iron.jit that returns artifacts instead of running them; we'd avoid this extra back and forth.

Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there.

We would need a bunch of if not hasattr(ExternalFunction, "_cache")? It doesn't look cleaner to me.

Does this attribute need to be a member of ExternalFunction? It seems it's only applicable to the compilation process.

I also don't understand why we'd need a number of them, since there's a single loop that does the compilation/linking and once you've done that you don't need to revisit it. A function that calculates the key on demand and then caches it would also do this or even the original suggestion of an archive class which would mirror what traditional toolchains do today.

In any case, since I'm using explicit object_file_names it's fine for my use-case but I can see how it can break outside iron.jit.

mawad-amd · 2025-08-19T14:16:11Z

You are welcome to change the JIT backend to support managing the code objects. As long as the abstraction is the same, that’s fine by me.

The behavior that this PR enable should just work. It doesn’t make sense to have same code, same compiler flags and you end up with two different code objects for JIT. Archive idea is fine but doesn’t solve the problem.

ypapadop-amd · 2025-08-19T14:34:04Z

You are welcome to change the JIT backend to support managing the code objects. As long as the abstraction is the same, that’s fine by me.

The behavior that this PR enable should just work. It doesn’t make sense to have same code, same compiler flags and you end up with two different code objects for JIT. Archive idea is fine but doesn’t solve the problem.

I never suggested that you should have different code objects, which is not supported anyway.

I will debate programmability though. This PR requires duplication of compile options at the creation of ExternalFunction even if they don't apply for that particular compiled function but still requires centralized function definition. I cannot think of another system with a similar behavior.

I don't think that following is valid, but it could be if you'd defer all caching and related until compilation:

tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string="""extern "C" {
        void add_constant_1(T1* input, T1* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
    "add_constant_2",
    source_string="""extern "C" {
        void add_constant_2(T2* input, T2* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

An archive would be on the other side of the spectrum as shown in #2509 (comment)

fifield · 2025-08-19T15:55:43Z

I don't think that following is valid, but it could be if you'd defer all caching and related until compilation:

tile_size = 16
add_one_function = ExternalFunction(
    "add_constant_1",
    source_string="""extern "C" {
        void add_constant_1(T1* input, T1* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 1;
            }
        }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
    "add_constant_2",
    source_string="""extern "C" {
        void add_constant_2(T2* input, T2* output, int tile_size) {
            for (int i = 0; i < tile_size; i++) {
                output[i] = input[i] + 2;
            }
        }
    }""",
    arg_types=[
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.ndarray[(tile_size,), np.dtype[np.int32]],
        np.int32,
    ],
    compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
    input_tensor, output_tensor, add_one_function, add_two_function
)

It's counter intuitive (broken?) that this doesn't already work.

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

ypapadop-amd · 2025-08-19T16:01:09Z

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

I thought we can only link against a single .o. If it's indeed a limitation, ld --relocatable can combine .o files.

mawad-amd · 2025-08-19T16:09:40Z

tile_size = 16
add_one_function = ExternalFunction(
"add_constant_1",
source_string="""extern "C" {
void add_constant_1(T1* input, T1* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 1;
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT1=int"],
)

add_two_function = ExternalFunction(
"add_constant_2",
source_string="""extern "C" {
void add_constant_2(T2* input, T2* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 2;
}
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT2=int"],
)

apply_pipeline_transform(
input_tensor, output_tensor, add_one_function, add_two_function
)
It'll require stitching the source code strings together and probably detect duplicate macro definitions.

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

(Jeff edit: somehow I hit edit instead of reply on original comment and messed it up, I think I've mostly restored it)

fifield · 2025-08-19T16:18:55Z

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

Existing programming example use case: combine and use

ypapadop-amd · 2025-08-19T17:16:42Z

This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem.

This PR solves a corner case of what we have been discussing but adds logic that would have been avoided if we solve the core issue of not having an interface to express composition of object files and creating of archives.

jgmelber · 2025-08-19T22:45:04Z

Have you considered two ExternalFunctions passed to the same Worker?

mawad-amd · 2025-08-19T23:04:35Z

Have you considered two ExternalFunctions passed to the same Worker?

Do you have an example? The example I saw (and wrote the test after) was using two workers.

jgmelber · 2025-08-20T15:14:03Z

Have you considered two ExternalFunctions passed to the same Worker?
Do you have an example? The example I saw (and wrote the test after) was using two workers.

This is the case I am thinking about:

mlir-aie/programming_examples/ml/silu/Makefile

Lines 33 to 57 in 19a0a20

    
           ifeq ($(devicename),npu) 
        
           build/lut_based_ops.o: ${aie2_runtime_dir}/lut_based_ops.cpp 
        
           	mkdir -p ${@D} 
        
           	cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -I. -c $< -o ${@F} 
        
           endif 
        
           build/%.cc.o: %.cc 
        
           	mkdir -p ${@D} 
        
           ifeq ($(devicename),npu) 
        
           	cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -I. -I${aie2_runtime_dir} -c $< -o ${@F} 
        
           else ifeq ($(devicename),npu2) 
        
           	cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2P_FLAGS} -c $< -o ${@F} 
        
           else 
        
           	echo "Device type not supported" 
        
           endif 
        
           ifeq ($(devicename),npu) 
        
           build/kernels.a: build/${targetname}.cc.o build/lut_based_ops.o 
        
           	ar rvs $@ $+ 
        
           else ifeq ($(devicename),npu2) 
        
           build/kernels.a: build/${targetname}.cc.o 
        
           	ar rvs $@ $+ 
        
           else 
        
           	echo "Device type not supported" 
        
           endif

mawad-amd · 2025-08-20T16:44:42Z

Have you considered two ExternalFunctions passed to the same Worker?

Yes, this is the same suggestion Jeff made. Will check it out.

ypapadop-amd · 2025-08-22T20:17:28Z

It'll require stitching the source code strings together and probably detect duplicate macro definitions.

Why? I would expect each ExternalFunction to create an object file from its source string + flags and that those object files might get linked together at the end depending on the requirements (e.g. they are on the same core tile). What is mlir-aie doing instead?

@mawad-amd @fifield @jgmelber I made a janky pass of this in #2530

jgmelber · 2025-09-26T02:58:54Z

Is this redundant to #2611

mawad-amd · 2025-09-26T15:16:45Z

No, it's not. #2661 caches the kernel launch object whereas this PR caches the external kernel compiled code object.

andrej · 2025-09-26T23:06:59Z

I might be misunderstanding the discussion here, please correct me if I am. I also don't want to derail this PR, so if this ends up warranting more discussion, let's move it elsewhere.

That being said, I believe at a high level:

We need a way to track dependencies (kernel objects, archives, ...) of designs.
If one dependency exports multiple symbols, it should be only built once, not once for each symbol. (This PR.)
We only want to re-build these dependencies when they change. (Caching)
Ideally, we'd also like to share these dependencies between multiple designs so we don't build the same code many times.
Dependencies should be defined not only by their source code but also by their compilation flags.
Some dependencies might be more difficult/require more steps to build than others (e.g. archives), so it would be nice to have a way to specify arbitrary required build steps for flexibility.

This PR addresses point 2. All the other points listed above are also challenges that might need to be adressed in the future as we continue to go down this path of building our designs straight from Python.

I believe all of this has been solved with existing build systems like CMake. I understand it might feel ugly or kludgey to call into a build system from Python, but I do think the idea warrants maybe a bit more discussion.

Now given the effort already put into JIT, I understand that moving to use CMake would require considerable refactoring of the existing JIT flow, but I will venture to predict that solving all build system problems/challenges in the Python implementation will also continue to be a lot of work going forward.

I think JIT could be a great frontend to generate a list of targets that are needed by a computation graph. For example, the output of JIT at the capture_graph() stage can be a list of CMake targets. (The user doesn't have to see this; we can call CMake from within Python to then immediately build those targets, e.g. in the compile() step.) Each target would need to be uniquely named for the configuration needed, so perhaps that will be a challenge here. But at that point, invoking CMake, would take care of most of the points I listed above (caching, dependency tracking, ...).

I believe this would also allow us to cleanly bundle how to compile/use a design together with the design, rather than baking that into the JIT code. Decoupling the user (JIT) from the provider of the design, and encapsulating how to build the design with the design, would also allow other flows to more easily reuse existing operators.

Another interesting use case would be deployment. If running the JIT capture_graph() gives us a CMakeLists with targets needed for the specific graph/model that it was run on, we can build that ahead-of-time and then directly ship only the compiled code to the user (without the compiler).

Again, sorry for commenting something a little besides the point of this PR, and my apologies if this has already been discussed. But please do let me know if anyone has thoughts on this or counterpoints on where the suggested approach would fail that the current approach handles more carefully. I'm probably not considering all angles here.

ypapadop-amd · 2025-09-27T02:31:12Z

Calling CMake from Python is an interesting approach, but creating CMakeLists on the fly is going to be a pain. There'd also be a lot of steps that you can fail, given there are 3 steps, configure, generate, build that CMake needs and each can fail for different reasons. If we consider different build systems, SCons or some other Python-based build systems may be easier to integrate.

I believe this would also allow us to cleanly bundle how to compile/use a design together with the design, rather than baking that into the JIT code. Decoupling the user (JIT) from the provider of the design, and encapsulating how to build the design with the design, would also allow other flows to more easily reuse existing operators.

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

andrej · 2025-09-29T15:42:53Z

If we consider different build systems, SCons or some other Python-based build systems may be easier to integrate.

Yes, I'd be on board with considering other build systems as well. I'm not well versed in the space but just see a lot of overlap.

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

What we've done elsewhere is create a CMake function like add_aie_gemm() that allows you to specify the particular parameters for the operator you want to generate. That allows for a clean interface between user and producer, since behind the function, you can do whatever you need to generate your targets (e.g., create archives), while only exposing the parameters to the user (i.e., the JIT user would only need to specify the required GEMM parameters). If then you need that operator elsewhere (outside of JIT), you can use the same CMake function, without having to worry about how to build it. My impression is that the build instructions of the operators are currently more intertwined with the JIT implementation, so extracting an operator out of the JIT flow when you need it elsewhere might be harder (?). I'll admit this is just from a superficial view of how the JIT flow is built up, so please feel free to correct me.

ypapadop-amd · 2025-09-29T16:12:12Z

I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system.

What we've done elsewhere is create a CMake function like add_aie_gemm() that allows you to specify the particular parameters for the operator you want to generate. That allows for a clean interface between user and producer, since behind the function, you can do whatever you need to generate your targets (e.g., create archives), while only exposing the parameters to the user (i.e., the JIT user would only need to specify the required GEMM parameters). If then you need that operator elsewhere (outside of JIT), you can use the same CMake function, without having to worry about how to build it. My impression is that the build instructions of the operators are currently more intertwined with the JIT implementation, so extracting an operator out of the JIT flow when you need it elsewhere might be harder (?). I'll admit this is just from a superficial view of how the JIT flow is built up, so please feel free to correct me.

I've done this as well and it was a maintenance burden. There's no way to express a contract such as API, parameter names, etc., when you cross the boundary between language and CMake. E.g., pybind11 allows you to create and name parameters in C++ and pass them to a Python function, there's no such capability for CMake integration; this will need to be added and may not be trivial.

Using CMake as the JIT manager creates a latency issue too, if it's the source of truth for detecting if a JIT'd function is up-to-date.

ypapadop-amd · 2025-10-13T15:04:52Z

Bumping this; do we have any way forward with this? I was trying to use ExternalFunction in the matmul example and it's not a great experience:

ExternalFunction objects need to be explicitly resolved for placed designs,
I'm relying on a really bad approach in my code to be able to support two functions in the same archive.

def create_mat_mul_external_functions(
    arch: str,
    input_tensors: list,
    output_tensor,
):
    m = 8
    n = 8
    k = 8
    use_scalar = False
    scalar_suffix = "_scalar" if use_scalar else ""

    num_cols = None
    if arch == "aie2":
        num_cols = 4
    elif arch == "aie2p":
        num_cols = 8
    else:
        raise ValueError(f"Unsupported architecture: {arch}")

    current_dir = path.dirname(path.realpath(__file__))
    source_file = path.join(current_dir, arch, "mm.cc")
    compile_args = [
        f"-DDIM_M={m}",
        f"-DDIM_N={n}",
        f"-DDIM_K={k}",
        f"-D{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}_ONLY",
        "-DB_COL_MAJ",
        "-DC_COL_MAJ",
    ]
    object_file_name = "matmul_core_functions.o"

    zero_fn = ExternalFunction(
        name=f"zero{scalar_suffix}_{dtype_to_str(output_tensor.dtype)}",
        object_file_name=object_file_name,
        source_file=source_file,
        arg_types=[np.ndarray[(m, n), np.dtype[output_tensor.dtype]]],
        compile_flags=compile_args,
    )

    matmul_fn = ExternalFunction(
        name=f"matmul{scalar_suffix}_{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}",
        object_file_name=object_file_name,
        source_file=source_file,
        arg_types=[
            np.ndarray[(m, k), np.dtype[input_tensors[0].dtype]],
            np.ndarray[(k, n), np.dtype[input_tensors[0].dtype]],
            np.ndarray[(m, n), np.dtype[output_tensor.dtype]],
        ],
        compile_flags=compile_args,
    )

    return (
        m,
        n,
        k,
        use_scalar,
        num_cols,
        zero_fn,
        matmul_fn,
    )

def my_matmul(arch: str, input_tensors: list, output_tensor)
   m, n, k, use_scalar, num_cols, zero_fn, matmul_fn = (
        create_mat_mul_external_functions(
            arch=arch, input_tensors=input_tensors, output_tensor=output_tensor
        )
    )

    with mlir_mod_ctx() as ctx:
        mat_mul_fn(
            dev=dev,
            M=A.shape[1],
            N=B.shape[1],
            K=A.shape[0],
            m=m,
            n=n,
            k=k,
            n_aie_cols=num_cols,
            dtype_in_str=dtype_to_str(A.dtype),
            dtype_out_str=dtype_to_str(C.dtype),
            b_col_maj=True,
            c_col_maj=True,
            use_scalar=use_scalar,
            emulate_bf16_mmul_with_bfp16=False,
            trace_size=0,
            zero_fn=zero_fn._name,
            matmul_fn=matmul_fn._name,
            object_file=matmul_fn.bin_name,
        )
        return ctx.module

jgmelber · 2025-11-05T20:17:18Z

Bumping this as well. What's the latest status on this? @mawad-amd @andrej @ypapadop-amd

mawad-amd added 2 commits August 17, 2025 09:30

Introduce code object cache

93e66b9

Add test

2eefa95

mawad-amd requested review from AndraBisca, fifield and stephenneuendorffer as code owners August 17, 2025 16:40

Add tests for external file

825ab5c

mawad-amd requested a review from jgmelber August 19, 2025 00:30

Make object_file_name override all cache logic

d89d954

Merge branch 'main' into muhaawad/object-cache

3f48511

Merge branch 'main' into muhaawad/object-cache

a203a36

Merge branch 'main' into muhaawad/object-cache

dfd9583

jgmelber mentioned this pull request Aug 20, 2025

Getting started programming examples #2502

Merged

Merge branch 'main' into muhaawad/object-cache

60c8f74

ypapadop-amd mentioned this pull request Aug 27, 2025

Offer unsurpising build behavior #2542

Open

Merge branch 'main' into muhaawad/object-cache

f2469ad

Merge branch 'main' into muhaawad/object-cache

5d60edb

mawad-amd requested review from andrej, jackl-xilinx and pvasireddy-amd as code owners September 26, 2025 15:17

hunhoffe added this to the JIT/Runtime/Compilation Refactor milestone Nov 10, 2025

hunhoffe mentioned this pull request Nov 19, 2025

Refine JIT Runtime/Compilation/Build Capabilities #2721

Open

21 tasks

hunhoffe removed this from the JIT/Runtime/Compilation Refactor milestone Nov 19, 2025

Add a code object cache for external functions #2509

Are you sure you want to change the base?

Add a code object cache for external functions #2509

Uh oh!

Conversation

mawad-amd commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypapadop-amd commented Aug 18, 2025

Uh oh!

mawad-amd commented Aug 18, 2025

Uh oh!

ypapadop-amd commented Aug 18, 2025

Uh oh!

mawad-amd commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawad-amd commented Aug 18, 2025

Uh oh!

ypapadop-amd commented Aug 18, 2025

Uh oh!

ypapadop-amd commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawad-amd commented Aug 18, 2025

Uh oh!

mawad-amd commented Aug 19, 2025

Uh oh!

ypapadop-amd commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawad-amd commented Aug 19, 2025

Uh oh!

mawad-amd commented Aug 19, 2025

Uh oh!

ypapadop-amd commented Aug 19, 2025

Uh oh!

mawad-amd commented Aug 19, 2025

Uh oh!

ypapadop-amd commented Aug 19, 2025

Uh oh!

fifield commented Aug 19, 2025

Uh oh!

ypapadop-amd commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawad-amd commented Aug 19, 2025 • edited by fifield Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fifield commented Aug 19, 2025

Uh oh!

ypapadop-amd commented Aug 19, 2025

Uh oh!

jgmelber commented Aug 19, 2025

Uh oh!

mawad-amd commented Aug 19, 2025

Uh oh!

jgmelber commented Aug 20, 2025 • edited by mawad-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mawad-amd commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypapadop-amd commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgmelber commented Sep 26, 2025

Uh oh!

mawad-amd commented Sep 26, 2025

Uh oh!

andrej commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypapadop-amd commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrej commented Sep 29, 2025

Uh oh!

ypapadop-amd commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mawad-amd commented Aug 17, 2025 •

edited

Loading

mawad-amd commented Aug 18, 2025 •

edited

Loading

ypapadop-amd commented Aug 18, 2025 •

edited

Loading

ypapadop-amd commented Aug 19, 2025 •

edited

Loading

ypapadop-amd commented Aug 19, 2025 •

edited

Loading

mawad-amd commented Aug 19, 2025 •

edited by fifield

Loading

jgmelber commented Aug 20, 2025 •

edited by mawad-amd

Loading

mawad-amd commented Aug 20, 2025 •

edited

Loading

ypapadop-amd commented Aug 22, 2025 •

edited

Loading

andrej commented Sep 26, 2025 •

edited

Loading

ypapadop-amd commented Sep 27, 2025 •

edited

Loading

ypapadop-amd commented Sep 29, 2025 •

edited

Loading