portable_simd version of Avg (4bpp) #641

Sentimentron · 2025-09-09T18:00:04Z

Implements a RGBA version of the Avg filter with portable_simd intrinsics.

CPU	Baseline	Result	Speedup
Arm Cortex A520	415.9 MiB/s	707.4 MiB/s	70.08%
Arm Cortex X4	2053.9 MiB/s	2334.9 MiB/s	13.68%
Apple Silicon M2	2053.9 MiB/s	2173.5 MiB/s	3.62%
AMD EPYC 7B13	2425.8 MiB/s	2150.8 MiB/s	-11.34%

Marked as draft until #632 is completed.

Sentimentron · 2025-09-09T18:58:01Z

AI disclosure: I wrote a original sliding-window portable_simd implementation of the Paeth filter (3bpp) and optimized it for best performance on the Cortex A520. I then used the Gemini family of LLMs provided by my employer to automatically adapt this code to the Avg filter from a written description, then optimize it to achieve the best possible code-generation and performance across all other micro-architectures in simulation. This PR is derived from that output, but includes documentation and other cleanups.

okaneco · 2025-09-12T19:40:18Z

There's another 4bpp case for the first row, where previous.is_empty(), not sure if you've tried that already.

image-png/src/filter.rs

Lines 612 to 624 in f33b850

    
           BytesPerPixel::Four => { 
        
               let mut prev = [0; 4]; 
        
               for chunk in current.chunks_exact_mut(4) { 
        
                   let new_chunk = [ 
        
                       chunk[0].wrapping_add(prev[0] / 2), 
        
                       chunk[1].wrapping_add(prev[1] / 2), 
        
                       chunk[2].wrapping_add(prev[2] / 2), 
        
                       chunk[3].wrapping_add(prev[3] / 2), 
        
                   ]; 
        
                   *TryInto::<&mut [u8; 4]>::try_into(chunk).unwrap() = new_chunk; 
        
                   prev = new_chunk; 
        
               } 
        
           }

Again, Cortex-A520 seems the big winner here, going from 415 MiB/s to about 700 MiB/s (70% faster), X4 benefits less.

Sentimentron · 2025-09-12T20:16:44Z

I hadn't tried it - wrote some quick code for it but it seems that the unfilter benchmark doesn't test this edge case... 🤔

Sentimentron · 2025-09-12T20:19:21Z

Also, if any contributors have access to some Intel hardware, could they give this portable_simd version a try? (Otherwise I'll cfg-gate it off in a subsequent version to avoid the AMD Epyc 7B13 regression).

perf: avg filter (4bpp)

096960b

Again, Cortex-A520 seems the big winner here, going from 415 MiB/s to about 700 MiB/s (70% faster), X4 benefits less.

Sentimentron force-pushed the portable_simd-avg-bpp4 branch from ac6b46c to 096960b Compare September 12, 2025 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

portable_simd version of Avg (4bpp) #641

portable_simd version of Avg (4bpp) #641

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

okaneco commented Sep 12, 2025 •

edited

Loading

Uh oh!

Sentimentron commented Sep 12, 2025

Uh oh!

Sentimentron commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

portable_simd version of Avg (4bpp) #641

Are you sure you want to change the base?

portable_simd version of Avg (4bpp) #641

Uh oh!

Conversation

Sentimentron commented Sep 9, 2025

Uh oh!

Sentimentron commented Sep 9, 2025

Uh oh!

okaneco commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sentimentron commented Sep 12, 2025

Uh oh!

Sentimentron commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

okaneco commented Sep 12, 2025 •

edited

Loading