Skip to content

Conversation

@Sentimentron
Copy link
Contributor

Implements a RGBA version of the Avg filter with portable_simd intrinsics.

CPU Baseline Result Speedup
Arm Cortex A520 415.9 MiB/s 707.4 MiB/s 70.08%
Arm Cortex X4 2053.9 MiB/s 2334.9 MiB/s 13.68%
Apple Silicon M2 2053.9 MiB/s 2173.5 MiB/s 3.62%
AMD EPYC 7B13 2425.8 MiB/s 2150.8 MiB/s -11.34%

Marked as draft until #632 is completed.

@Sentimentron
Copy link
Contributor Author

AI disclosure: I wrote a original sliding-window portable_simd implementation of the Paeth filter (3bpp) and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs provided by my employer to automatically adapt this code to the Avg filter from a written description, then optimize it to achieve the best possible code-generation and performance across all other micro-architectures in simulation. This PR is derived from that output, but includes documentation and other cleanups.

@okaneco
Copy link
Contributor

okaneco commented Sep 12, 2025

There's another 4bpp case for the first row, where previous.is_empty(), not sure if you've tried that already.

image-png/src/filter.rs

Lines 612 to 624 in f33b850

BytesPerPixel::Four => {
let mut prev = [0; 4];
for chunk in current.chunks_exact_mut(4) {
let new_chunk = [
chunk[0].wrapping_add(prev[0] / 2),
chunk[1].wrapping_add(prev[1] / 2),
chunk[2].wrapping_add(prev[2] / 2),
chunk[3].wrapping_add(prev[3] / 2),
];
*TryInto::<&mut [u8; 4]>::try_into(chunk).unwrap() = new_chunk;
prev = new_chunk;
}
}

Again, Cortex-A520 seems the big winner here, going from 415 MiB/s to
about 700 MiB/s (70% faster), X4 benefits less.
@Sentimentron
Copy link
Contributor Author

I hadn't tried it - wrote some quick code for it but it seems that the unfilter benchmark doesn't test this edge case... 🤔

@Sentimentron
Copy link
Contributor Author

Also, if any contributors have access to some Intel hardware, could they give this portable_simd version a try? (Otherwise I'll cfg-gate it off in a subsequent version to avoid the AMD Epyc 7B13 regression).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants