-
Notifications
You must be signed in to change notification settings - Fork 161
Add a code object cache for external functions #2509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Can we instead of having it as a cache to have it as an explicit object that manages an archive? E.g., archive = ExternalArchive(
file_name="my_archive.o",
source_string=source_code,
compile_flags=[
...
],
include_dirs = [
...
]
)
# Create ExternalFunction for adding one (first stage)
tile_size = 16
add_one_function = ExternalFunction(
"add_constant_1",
archive=archive,
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
)
# Create ExternalFunction for adding two (second stage)
add_two_function = ExternalFunction(
"add_constant_2",
archive=archive,
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
)
# Apply the transform: input -> add_constant_1 -> add_constant_2 -> output
# This will apply: input + 1 + 2 = input + 3
apply_pipeline_transform(
input_tensor, output_tensor, add_one_function, add_two_function
)For non-trivial examples, we'll have include directories and multiple flags that could lead to unexpected behavior if we duplicate them and they're not in sync. |
|
The include directories and compiler flags are part of the hash that gets computed which ends up being the code object name. If everything does not exactly match, you will get two different code objects and the worker class will complain that you are using two code objects. This behavior is well defined here. |
|
It's still a less friendly developer experience to have to duplicate flags and include directories for something that should be managed centrally. It also seems that it forces you to put a file in the kernel directory which is not ideal for projects that put artifacts in different directories than the source code. |
Sure, this PR just enables current code to work when it should. Eventually we can introduce an "ExternalModule" or "ExternalCodeObject". This PR is not doing this.
What do you mean? User source code can come from string or a file. See tests. |
Yeah, this is a different PR. Feel free to code it up. |
If I understand it correctly, it writes something (code?) into the directory that the kernel is in. Isn't that correct? |
Is there a way to disable this support? If not, it will break my workflow and it's not acceptable to break users' workflows. |
|
Old behavior should work fine. If you are specifying the code object name, you override everything. Please test it and let me know if it breaks your code. |
|
@ypapadop-amd, any update? |
It appears that initializing an Can we put the cache inside the compile function instead of being a singleton of the class? I don't see a particular technical reason to be there. |
Ok, I can fix that. Can you share a test for your use case? We can make sure it always works.
We would need a bunch of |
|
@ypapadop-amd Hopefully d89d954 make things work now. I would like a test/description of one whenever you have some time. |
It may be difficult to write a test case outside of what I'm doing since it's a different build process. I can try but it's extra churn. This is why I've been suggesting a version of
Does this attribute need to be a member of I also don't understand why we'd need a number of them, since there's a single loop that does the compilation/linking and once you've done that you don't need to revisit it. A function that calculates the key on demand and then caches it would also do this or even the original suggestion of an archive class which would mirror what traditional toolchains do today. In any case, since I'm using explicit |
|
You are welcome to change the JIT backend to support managing the code objects. As long as the abstraction is the same, that’s fine by me. The behavior that this PR enable should just work. It doesn’t make sense to have same code, same compiler flags and you end up with two different code objects for JIT. Archive idea is fine but doesn’t solve the problem. |
I never suggested that you should have different code objects, which is not supported anyway. I will debate programmability though. This PR requires duplication of compile options at the creation of I don't think that following is valid, but it could be if you'd defer all caching and related until compilation: tile_size = 16
add_one_function = ExternalFunction(
"add_constant_1",
source_string="""extern "C" {
void add_constant_1(T1* input, T1* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 1;
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT1=int"],
)
add_two_function = ExternalFunction(
"add_constant_2",
source_string="""extern "C" {
void add_constant_2(T2* input, T2* output, int tile_size) {
for (int i = 0; i < tile_size; i++) {
output[i] = input[i] + 2;
}
}
}""",
arg_types=[
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.ndarray[(tile_size,), np.dtype[np.int32]],
np.int32,
],
compilation_flags = ["-DT2=int"],
)
apply_pipeline_transform(
input_tensor, output_tensor, add_one_function, add_two_function
)It'll require stitching the source code strings together and probably detect duplicate macro definitions. An archive would be on the other side of the spectrum as shown in #2509 (comment) |
It's counter intuitive (broken?) that this doesn't already work.
Why? I would expect each |
I thought we can only link against a single |
This PR is not addressing this case. This looks like a good motivation for what you are proposing (an actual use case would be even better). Again, you are free to implement it but This PR addresses a different use case per the PR description. I would prefer if we move the archive discussion to a different issue/PR. This PR is not trying to solve that problem. (Jeff edit: somehow I hit edit instead of reply on original comment and messed it up, I think I've mostly restored it) |
|
This PR solves a corner case of what we have been discussing but adds logic that would have been avoided if we solve the core issue of not having an interface to express composition of object files and creating of archives. |
|
Have you considered two |
Do you have an example? The example I saw (and wrote the test after) was using two workers. |
This is the case I am thinking about: mlir-aie/programming_examples/ml/silu/Makefile Lines 33 to 57 in 19a0a20
|
Yes, this is the same suggestion Jeff made. Will check it out. |
@mawad-amd @fifield @jgmelber I made a janky pass of this in #2530 |
|
Is this redundant to #2611 |
|
No, it's not. #2661 caches the kernel launch object whereas this PR caches the external kernel compiled code object. |
|
I might be misunderstanding the discussion here, please correct me if I am. I also don't want to derail this PR, so if this ends up warranting more discussion, let's move it elsewhere. That being said, I believe at a high level:
This PR addresses point 2. All the other points listed above are also challenges that might need to be adressed in the future as we continue to go down this path of building our designs straight from Python. I believe all of this has been solved with existing build systems like CMake. I understand it might feel ugly or kludgey to call into a build system from Python, but I do think the idea warrants maybe a bit more discussion. Now given the effort already put into JIT, I understand that moving to use CMake would require considerable refactoring of the existing JIT flow, but I will venture to predict that solving all build system problems/challenges in the Python implementation will also continue to be a lot of work going forward. I think JIT could be a great frontend to generate a list of targets that are needed by a computation graph. For example, the output of JIT at the I believe this would also allow us to cleanly bundle how to compile/use a design together with the design, rather than baking that into the JIT code. Decoupling the user (JIT) from the provider of the design, and encapsulating how to build the design with the design, would also allow other flows to more easily reuse existing operators. Another interesting use case would be deployment. If running the JIT Again, sorry for commenting something a little besides the point of this PR, and my apologies if this has already been discussed. But please do let me know if anyone has thoughts on this or counterpoints on where the suggested approach would fail that the current approach handles more carefully. I'm probably not considering all angles here. |
|
Calling CMake from Python is an interesting approach, but creating CMakeLists on the fly is going to be a pain. There'd also be a lot of steps that you can fail, given there are 3 steps, configure, generate, build that CMake needs and each can fail for different reasons. If we consider different build systems, SCons or some other Python-based build systems may be easier to integrate.
I can't envision how CMake can help with this. It's not great to begin with if you want to mix multiple languages let alone integrating it as a JIT build system. |
Yes, I'd be on board with considering other build systems as well. I'm not well versed in the space but just see a lot of overlap.
What we've done elsewhere is create a CMake function like |
I've done this as well and it was a maintenance burden. There's no way to express a contract such as API, parameter names, etc., when you cross the boundary between language and CMake. E.g., pybind11 allows you to create and name parameters in C++ and pass them to a Python function, there's no such capability for CMake integration; this will need to be added and may not be trivial. Using CMake as the JIT manager creates a latency issue too, if it's the source of truth for detecting if a JIT'd function is up-to-date. |
|
Bumping this; do we have any way forward with this? I was trying to use
def create_mat_mul_external_functions(
arch: str,
input_tensors: list,
output_tensor,
):
m = 8
n = 8
k = 8
use_scalar = False
scalar_suffix = "_scalar" if use_scalar else ""
num_cols = None
if arch == "aie2":
num_cols = 4
elif arch == "aie2p":
num_cols = 8
else:
raise ValueError(f"Unsupported architecture: {arch}")
current_dir = path.dirname(path.realpath(__file__))
source_file = path.join(current_dir, arch, "mm.cc")
compile_args = [
f"-DDIM_M={m}",
f"-DDIM_N={n}",
f"-DDIM_K={k}",
f"-D{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}_ONLY",
"-DB_COL_MAJ",
"-DC_COL_MAJ",
]
object_file_name = "matmul_core_functions.o"
zero_fn = ExternalFunction(
name=f"zero{scalar_suffix}_{dtype_to_str(output_tensor.dtype)}",
object_file_name=object_file_name,
source_file=source_file,
arg_types=[np.ndarray[(m, n), np.dtype[output_tensor.dtype]]],
compile_flags=compile_args,
)
matmul_fn = ExternalFunction(
name=f"matmul{scalar_suffix}_{dtype_to_str(input_tensors[0].dtype)}_{dtype_to_str(output_tensor.dtype)}",
object_file_name=object_file_name,
source_file=source_file,
arg_types=[
np.ndarray[(m, k), np.dtype[input_tensors[0].dtype]],
np.ndarray[(k, n), np.dtype[input_tensors[0].dtype]],
np.ndarray[(m, n), np.dtype[output_tensor.dtype]],
],
compile_flags=compile_args,
)
return (
m,
n,
k,
use_scalar,
num_cols,
zero_fn,
matmul_fn,
)
def my_matmul(arch: str, input_tensors: list, output_tensor)
m, n, k, use_scalar, num_cols, zero_fn, matmul_fn = (
create_mat_mul_external_functions(
arch=arch, input_tensors=input_tensors, output_tensor=output_tensor
)
)
with mlir_mod_ctx() as ctx:
mat_mul_fn(
dev=dev,
M=A.shape[1],
N=B.shape[1],
K=A.shape[0],
m=m,
n=n,
k=k,
n_aie_cols=num_cols,
dtype_in_str=dtype_to_str(A.dtype),
dtype_out_str=dtype_to_str(C.dtype),
b_col_maj=True,
c_col_maj=True,
use_scalar=use_scalar,
emulate_bf16_mmul_with_bfp16=False,
trace_size=0,
zero_fn=zero_fn._name,
matmul_fn=matmul_fn._name,
object_file=matmul_fn.bin_name,
)
return ctx.module |
|
Bumping this as well. What's the latest status on this? @mawad-amd @andrej @ypapadop-amd |
Today if we write:
The two
ExternalFunctions will be associated with two different objects which doesn't make sense. The source code is the same, the compiler flags are the same, and hence we should be smart enough to detect that we need a single code object. This PR adds a cache so we can support multiple kernels from the same object without much overhead or extra lines of code.