Implement matmul_veclib() with cblas#800
Conversation
20023ff to
c91cecd
Compare
|
Please convert this back to draft before leaving inline annotation and have linter clean. |
54201f8 to
a05324b
Compare
ThreeMonth03
left a comment
There was a problem hiding this comment.
@yungyuc Please take a look.
| /** | ||
| * Perform matrix multiplication using Accelerate/CBLAS when available. | ||
| */ | ||
| template <typename A, typename T> | ||
| A SimpleArrayMatmulHelper<A, T>::matmul_veclib() | ||
| { | ||
| if (m_lhs.ndim() == 1 && m_rhs.ndim() == 1) | ||
| { | ||
| return matmul_vec_vec(); | ||
| } | ||
| if (m_lhs.ndim() == 1) | ||
| { | ||
| return matmul_vec_mat(); | ||
| } | ||
| if (m_rhs.ndim() == 1) | ||
| { | ||
| return matmul_mat_vec(); | ||
| } | ||
|
|
||
| return matmul_mat_mat_veclib(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Create new api interface.
| /** | ||
| * Perform matrix multiplication using Accelerate/CBLAS when available. | ||
| */ | ||
| template <typename A, typename T> | ||
| A SimpleArrayMatmulHelper<A, T>::matmul_veclib() | ||
| { | ||
| if (m_lhs.ndim() == 1 && m_rhs.ndim() == 1) | ||
| { | ||
| return matmul_vec_vec(); | ||
| } | ||
| if (m_lhs.ndim() == 1) | ||
| { | ||
| return matmul_vec_mat(); | ||
| } | ||
| if (m_rhs.ndim() == 1) | ||
| { | ||
| return matmul_mat_vec(); | ||
| } | ||
|
|
||
| return matmul_mat_mat_veclib(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Create new api interface.
| template <typename A, typename T> | ||
| A SimpleArrayMatmulHelper<A, T>::matmul_mat_mat_veclib() | ||
| { | ||
| if (!m_lhs.is_c_contiguous() || !m_rhs.is_c_contiguous()) | ||
| { | ||
| return matmul_mat_mat(); | ||
| } | ||
|
|
||
| size_t const m = m_result.shape(0); | ||
| size_t const n = m_result.shape(1); | ||
| size_t const k = m_lhs.shape(1); | ||
| simd::matmul(m, n, k, m_lhs.data(), m_rhs.data(), m_result.data()); | ||
| return std::move(m_result); | ||
| } | ||
|
|
There was a problem hiding this comment.
Only consider accelerate library when SimpleArray are c-contiguous.
| template <typename T> | ||
| inline constexpr bool is_std_complex_layout_compatible_v = std::is_standard_layout_v<Complex<T>> && | ||
| sizeof(Complex<T>) == sizeof(std::complex<T>) && | ||
| alignof(Complex<T>) == alignof(std::complex<T>); | ||
|
|
||
| static_assert(is_std_complex_layout_compatible_v<float>); | ||
| static_assert(is_std_complex_layout_compatible_v<double>); | ||
|
|
||
| template <typename T> | ||
| std::complex<T> const * as_std_complex_pointer(Complex<T> const * ptr) | ||
| { | ||
| return reinterpret_cast<std::complex<T> const *>(ptr); // NOLINT(cppcoreguidelines-pro-type-reinterpret-cast) | ||
| } | ||
|
|
||
| template <typename T> | ||
| std::complex<T> * as_std_complex_pointer(Complex<T> * ptr) | ||
| { | ||
| return reinterpret_cast<std::complex<T> *>(ptr); // NOLINT(cppcoreguidelines-pro-type-reinterpret-cast) | ||
| } |
There was a problem hiding this comment.
In order to support matrix multiplication of Complex, it is necessary to ensure the layout of std::complex<T> and Complex<T> are aligned.
Additionally, it is allowed to reinterpret cast from Complex<T> to std::complex<T>.
There was a problem hiding this comment.
Discussion: it sounds like that our own Complex is unnecessary and may directly use std::complex?
@j8xixo12 , what do you think?
There was a problem hiding this comment.
I think we can directly use std::complex, it should reduce the maintaining efforts.
There was a problem hiding this comment.
The difference of api between std::complex and Complex is comparison operators. We implement comparison operators based on lexicographic order in Complex, but there are not any comparison operators in std::complex.
Without comparison operators, some apis, like min() or max() will not support complex type. Therefore, we might create some helper functions if Complex is substituted.
There was a problem hiding this comment.
Right, having Complex allows us customize operators without worrying about compatibility to other code that uses the STL std::complex. It is a big win.
Let's keep our Complex for a while.
|
|
||
| #include <modmesh/simd/accelerate/accelerate.hpp> | ||
|
|
||
| #if defined(__APPLE__) && defined(__arm64__) |
There was a problem hiding this comment.
This file would be compiled if the device is macos with arm64 architecture.
There was a problem hiding this comment.
It is not right to put it under simd/ because we did not add SIMD code for it. It is a vendor library. Use a distinct file under buffer/ for it.
| enum class GemmBackend : uint8_t | ||
| { | ||
| Generic, | ||
| Accelerate, | ||
| }; | ||
|
|
||
| template <typename T> | ||
| inline constexpr GemmBackend gemm_backend_v = accelerate::supports_matmul_v<T> | ||
| ? GemmBackend::Accelerate | ||
| : GemmBackend::Generic; | ||
|
|
||
| } /* namespace detail */ | ||
|
|
||
| template <typename T> | ||
| void matmul(size_t m, size_t n, size_t k, T const * lhs, T const * rhs, T * result) | ||
| { | ||
| if constexpr (detail::gemm_backend_v<T> == detail::GemmBackend::Accelerate) | ||
| { | ||
| detail::accelerate::matmul(m, n, k, lhs, rhs, result); | ||
| } | ||
| else | ||
| { | ||
| generic::matmul(m, n, k, lhs, rhs, result); | ||
| } | ||
| } |
There was a problem hiding this comment.
The function would determine the backend.
| (std::is_same_v<T, float> || | ||
| std::is_same_v<T, double> || | ||
| std::is_same_v<T, Complex<float>> || | ||
| std::is_same_v<T, Complex<double>>); |
There was a problem hiding this comment.
cblas only supports these type.
There was a problem hiding this comment.
Yes. They are what we need for now.
| ${CMAKE_CURRENT_SOURCE_DIR}/gemm.hpp | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/gemm_generic.hpp | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/accelerate/accelerate.hpp | ||
| CACHE FILEPATH "" FORCE) | ||
|
|
||
| set(MODMESH_SIMD_SOURCES | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/simd_support.cpp | ||
| ${CMAKE_CURRENT_SOURCE_DIR}/accelerate/accelerate.cpp |
There was a problem hiding this comment.
The pathnames and filenames are unreasonable, but I have to idea where to put these files.
There was a problem hiding this comment.
Test the correctness of matmul_veclib().
There was a problem hiding this comment.
Profile matmul_veclib().
|
By the way, I decide to check the latency of profiler if I'm available, because the profiling result looks weird when numpy array is small. If there is any update, I will make a pull request. |
yungyuc
left a comment
There was a problem hiding this comment.
- Keep the wrappers
matmul,matmul_veclib, andmatmul_fastclose. - Discussion: Maybe
imatmul_fastshould also be wrapped to Python? - Discussion: Do we need our own
Complexor should directly usestd::complex? - Use a distinct file under
buffer/for the veclib matmul wrapper. - Add a FIXME and a follow-up issue to track fixing silently passed failures for missing veclib.
| { return self.div(scalar); }) | ||
| .def("matmul", &wrapped_type::matmul) | ||
| .def("__matmul__", &wrapped_type::matmul) | ||
| .def("matmul_veclib", &wrapped_type::matmul_veclib) |
There was a problem hiding this comment.
Good name (matmul_veclib). It should be placed after matmul, and fast_matmul should be renamed as matmul_fast for consistent names.
| { self.idiv(scalar); }) | ||
| .def("imatmul", [](wrapped_type & self, wrapped_type const & other) | ||
| { self.imatmul(other); }) | ||
| .def("imatmul_veclib", [](wrapped_type & self, wrapped_type const & other) |
There was a problem hiding this comment.
Discussion: Maybe imatmul_fast should also be wrapped to Python?
There was a problem hiding this comment.
Yes, it should be wrapped to Python. We have implemented this api.
| template <typename A, typename T> | ||
| A SimpleArrayMatmulHelper<A, T>::matmul_mat_mat_veclib() | ||
| { | ||
| if (!m_lhs.is_c_contiguous() || !m_rhs.is_c_contiguous()) | ||
| { | ||
| return matmul_mat_mat(); | ||
| } | ||
|
|
||
| size_t const m = m_result.shape(0); | ||
| size_t const n = m_result.shape(1); | ||
| size_t const k = m_lhs.shape(1); | ||
| simd::matmul(m, n, k, m_lhs.data(), m_rhs.data(), m_result.data()); | ||
| return std::move(m_result); | ||
| } | ||
|
|
| template <typename T> | ||
| inline constexpr bool is_std_complex_layout_compatible_v = std::is_standard_layout_v<Complex<T>> && | ||
| sizeof(Complex<T>) == sizeof(std::complex<T>) && | ||
| alignof(Complex<T>) == alignof(std::complex<T>); | ||
|
|
||
| static_assert(is_std_complex_layout_compatible_v<float>); | ||
| static_assert(is_std_complex_layout_compatible_v<double>); | ||
|
|
||
| template <typename T> | ||
| std::complex<T> const * as_std_complex_pointer(Complex<T> const * ptr) | ||
| { | ||
| return reinterpret_cast<std::complex<T> const *>(ptr); // NOLINT(cppcoreguidelines-pro-type-reinterpret-cast) | ||
| } | ||
|
|
||
| template <typename T> | ||
| std::complex<T> * as_std_complex_pointer(Complex<T> * ptr) | ||
| { | ||
| return reinterpret_cast<std::complex<T> *>(ptr); // NOLINT(cppcoreguidelines-pro-type-reinterpret-cast) | ||
| } |
There was a problem hiding this comment.
Discussion: it sounds like that our own Complex is unnecessary and may directly use std::complex?
@j8xixo12 , what do you think?
|
|
||
| #include <modmesh/simd/accelerate/accelerate.hpp> | ||
|
|
||
| #if defined(__APPLE__) && defined(__arm64__) |
There was a problem hiding this comment.
It is not right to put it under simd/ because we did not add SIMD code for it. It is a vendor library. Use a distinct file under buffer/ for it.
| (std::is_same_v<T, float> || | ||
| std::is_same_v<T, double> || | ||
| std::is_same_v<T, Complex<float>> || | ||
| std::is_same_v<T, Complex<double>>); |
There was a problem hiding this comment.
Yes. They are what we need for now.
There was a problem hiding this comment.
Files you added under simd/ should all go to buffer/.
| try: | ||
| veclib_result = lhs.matmul_veclib(rhs) | ||
| except RuntimeError as exc: | ||
| self.assertEqual(str(exc), veclib_unavailable) |
There was a problem hiding this comment.
This is tricky. Failures are silently passed. When there is not veclib the tests should be marked with expected failures, but non-veclib tests should pass. That is, we will need distinct test functions only for veclib.
Please add a FIXME here and create a follow-up issue to track the fix.
ac1981f to
188ce0e
Compare
ThreeMonth03
left a comment
There was a problem hiding this comment.
- Keep the wrappers matmul, matmul_veclib, and matmul_fast close.
- Discussion: Maybe imatmul_fast should also be wrapped to Python?
- Discussion: Do we need our own Complex or should directly use std::complex?
- Use a distinct file under buffer/ for the veclib matmul wrapper.
- Add a FIXME and a follow-up issue to track fixing silently passed failures for missing veclib.
@yungyuc Please take a look. Thanks.
| { self.idiv(scalar); }) | ||
| .def("imatmul", [](wrapped_type & self, wrapped_type const & other) | ||
| { self.imatmul(other); }) | ||
| .def("imatmul_veclib", [](wrapped_type & self, wrapped_type const & other) |
There was a problem hiding this comment.
Yes, it should be wrapped to Python. We have implemented this api.
| .def("matmul", &wrapped_type::matmul) | ||
| .def("__matmul__", &wrapped_type::matmul) | ||
| .def("matmul_veclib", &wrapped_type::matmul_veclib) | ||
| .def( | ||
| "fast_matmul", | ||
| "matmul_fast", |
There was a problem hiding this comment.
Rename matmul_fast and replace apis.
| .def( | ||
| "imatmul_fast", | ||
| [](wrapped_type & self, | ||
| wrapped_type const & other, | ||
| size_t tile_x, | ||
| size_t tile_y, | ||
| size_t tile_z) | ||
| { self.imatmul_fast(other, tile_x, tile_y, tile_z); }, | ||
| py::arg("other"), | ||
| py::arg("tile_x") = 16, | ||
| py::arg("tile_y") = 16, | ||
| py::arg("tile_z") = 16) |
There was a problem hiding this comment.
I wrapped imatmul_fast to python now.
There was a problem hiding this comment.
Move helper class SimpleArrayMatmulHelper to matmul.hpp.
| except RuntimeError as exc: | ||
| # FIXME: Split veclib backend coverage into dedicated tests and | ||
| # mark unsupported platforms as expected failures once a follow-up | ||
| # issue is filed. | ||
| self.assertEqual(str(exc), veclib_unavailable) | ||
| return |
There was a problem hiding this comment.
Add FIXME comment temporaily. I would open a new issue later.
There was a problem hiding this comment.
By the way, matmul_veclib has been modified. It would be acted as matmul if it is not used on apple platform. I'm not sure whether we need try/except now.
188ce0e to
4f467d9
Compare
4f467d9 to
663a693
Compare
As issue #789 mentioned, we need to implement and benchmark the speed of
matmul_veclib(), which uses accelerate framework. Therefore, this pull request create new apimatmul_veclib()and profile the operation time.The following chart and sheet are profiling results. It shows that the speed of


matmul_veclib()outperform numpy library.2D x 2D shape: (4, 4) x (4, 4) dtype:
float322D x 2D shape: (16, 16) x (16, 16) dtype:
float322D x 2D shape: (64, 64) x (64, 64) dtype:
float322D x 2D shape: (256, 256) x (256, 256) dtype:
float322D x 2D shape: (1024, 1024) x (1024, 1024) dtype:
float322D x 2D shape: (4, 4) x (4, 4) dtype:
float642D x 2D shape: (16, 16) x (16, 16) dtype:
float642D x 2D shape: (64, 64) x (64, 64) dtype:
float642D x 2D shape: (256, 256) x (256, 256) dtype:
float642D x 2D shape: (1024, 1024) x (1024, 1024) dtype:
float642D x 2D shape: (9, 9) x (9, 9) dtype:
float322D x 2D shape: (27, 27) x (27, 27) dtype:
float322D x 2D shape: (81, 81) x (81, 81) dtype:
float322D x 2D shape: (243, 243) x (243, 243) dtype:
float322D x 2D shape: (729, 729) x (729, 729) dtype:
float322D x 2D shape: (9, 9) x (9, 9) dtype:
float642D x 2D shape: (27, 27) x (27, 27) dtype:
float642D x 2D shape: (81, 81) x (81, 81) dtype:
float642D x 2D shape: (243, 243) x (243, 243) dtype:
float642D x 2D shape: (729, 729) x (729, 729) dtype:
float64