Skip to content

Conversation

@197g
Copy link
Member

@197g 197g commented Sep 4, 2025

Closes: #639

With the in-place unfilter buffer we forced unfiltering to trail behind the deflate window whereas previously it could copy data that was still used for back-reference and filter on it. That resulted in better performance but for partial images we lost out on 32kB of available image data. This relaxes the UnfilteringBuffer to allow both: use in-place for the common path but add a fallback for unfiltering the back reference portion of the data when we hit UnexpectedEof.

Performance with this patch is mostly good but I'm hitting some erratic, weird fluctuations on the noncompressed tests (slower) and paletted-zune (faster) and Exoplanet_Phase_Curve_(Diagram).png (also much faster??). I'd very much like to understand the individual causes here given that we do just two extra bounds checks and an integer-min that shouldn't be in extremely tight loops.

197g added 2 commits September 3, 2025 23:47
This relaxes the constraints that the regions between available..filled
and the actual rows that are worked on do not overlap. Instead,
different methods are used depending which part is worked on. The
obligation of avoiding a mutation of the back reference buffer is on the
caller to decide (we have debug_asserts in place) which had already been
the case since the usual unfilter_row method would not verify that
enough space was available.

There's a few points that offer design decisions:

- Previously we had the implicit invariant that the zlib window must
  start at `0` if it was not full yet, since the decoder determines the
  back-reference range by subtracting the maximum possible window.
  Without any compensation we can not place a mutable row region at the
  start of the data buffer before it is full, otherwise we would change
  the read-only window *backwards* violating invariants.

  Since a window that is too big isn't a problem we could ignore that
  and move data buffer by a large amount backwards if we establish that
  region.

  Alternatively, any commit to the `available` field can enforce that it
  is only updated forwards (i.e. it only moves back with the data).

- The `prev_row` index was used to determine the amount of available
  shift back but it can now be within the region if an interlace reset
  occurs while we do read-ahead of rows (we still use the check
  `prev_row == current_row` to distinguish the first row of each
  interlace pass).

  So the shift back is changed to use the minimum of `prev_row` and
  `available` which is equivalent to the previous for a shift back after
  `reset` in all existing cases but also works if we shift back after
  having read into the deflate buffer region.
@197g
Copy link
Member Author

197g commented Sep 4, 2025

Performance seems to have been an artifact of the known problems llvm. Here's the results from Rust 1.89 locally:

test result: ok. 0 passed; 0 failed; 82 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/decoder.rs (target/release/deps/decoder-bbb4e9ee3bf79167)
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
                        time:   [13.993 ms 14.008 ms 14.026 ms]
                        thrpt:  [761.35 MiB/s 762.31 MiB/s 763.14 MiB/s]
                 change:
                        time:   [+0.3531% +0.5748% +0.8034%] (p = 0.00 < 0.05)
                        thrpt:  [−0.7970% −0.5715% −0.3518%]
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
decode/Transparency.png time:   [48.908 µs 48.935 µs 49.007 µs]
                        thrpt:  [6.8414 GiB/s 6.8514 GiB/s 6.8553 GiB/s]
                 change:
                        time:   [+0.1524% +0.2277% +0.3460%] (p = 0.00 < 0.05)
                        thrpt:  [−0.3448% −0.2272% −0.1522%]
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
decode/kodim02.png      time:   [2.0754 ms 2.0758 ms 2.0765 ms]
                        thrpt:  [541.79 MiB/s 541.97 MiB/s 542.07 MiB/s]
                 change:
                        time:   [−3.9275% −2.3384% −1.3331%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3511% +2.3944% +4.0881%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
decode/kodim07.png      time:   [2.5150 ms 2.5179 ms 2.5218 ms]
                        thrpt:  [446.11 MiB/s 446.80 MiB/s 447.31 MiB/s]
                 change:
                        time:   [+0.1683% +0.2903% +0.4104%] (p = 0.00 < 0.05)
                        thrpt:  [−0.4087% −0.2894% −0.1680%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
decode/kodim17.png      time:   [2.0609 ms 2.0614 ms 2.0621 ms]
                        thrpt:  [545.57 MiB/s 545.74 MiB/s 545.88 MiB/s]
                 change:
                        time:   [+0.7159% +0.7368% +0.7580%] (p = 0.00 < 0.05)
                        thrpt:  [−0.7523% −0.7314% −0.7108%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
decode/kodim23.png      time:   [1.9937 ms 1.9940 ms 1.9942 ms]
                        thrpt:  [564.15 MiB/s 564.20 MiB/s 564.27 MiB/s]
                 change:
                        time:   [+0.6497% +0.6881% +0.7242%] (p = 0.00 < 0.05)
                        thrpt:  [−0.7190% −0.6834% −0.6455%]
                        Change within noise threshold.
decode/Exoplanet_Phase_Curve_(Diagram).png
                        time:   [14.449 ms 14.462 ms 14.491 ms]
                        thrpt:  [1.5992 GiB/s 1.6024 GiB/s 1.6039 GiB/s]
                 change:
                        time:   [−1.5108% −1.0287% −0.6139%] (p = 0.00 < 0.05)
                        thrpt:  [+0.6177% +1.0394% +1.5340%]
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
decode/Exoplanet_Phase_Curve_(Diagram)_indexed_gimp.png
                        time:   [5.1186 ms 5.1201 ms 5.1210 ms]
                        thrpt:  [4.5253 GiB/s 4.5262 GiB/s 4.5275 GiB/s]
                 change:
                        time:   [+0.2168% +0.2882% +0.3530%] (p = 0.00 < 0.05)
                        thrpt:  [−0.3517% −0.2874% −0.2163%]
                        Change within noise threshold.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  2 (20.00%) high mild
decode/Fantasy_Digital_Painting.png
                        time:   [12.341 ms 12.355 ms 12.371 ms]
                        thrpt:  [511.36 MiB/s 512.02 MiB/s 512.60 MiB/s]
                 change:
                        time:   [+0.9006% +1.0055% +1.1120%] (p = 0.00 < 0.05)
                        thrpt:  [−1.0998% −0.9954% −0.8926%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
decode/lorem_ipsum_oxipng.png
                        time:   [432.52 µs 435.01 µs 438.94 µs]
                        thrpt:  [3.1781 GiB/s 3.2068 GiB/s 3.2253 GiB/s]
                 change:
                        time:   [+0.1332% +1.4542% +2.7337%] (p = 0.06 > 0.05)
                        thrpt:  [−2.6610% −1.4334% −0.1330%]
                        No change in performance detected.
decode/lorem_ipsum_screenshot.png
                        time:   [660.77 µs 661.09 µs 661.31 µs]
                        thrpt:  [2.8126 GiB/s 2.8135 GiB/s 2.8149 GiB/s]
                 change:
                        time:   [+0.6805% +0.7392% +0.7972%] (p = 0.00 < 0.05)
                        thrpt:  [−0.7909% −0.7338% −0.6759%]
                        Change within noise threshold.
decode/paletted-zune.png
                        time:   [6.0131 ms 6.0136 ms 6.0143 ms]
                        thrpt:  [2.1435 GiB/s 2.1438 GiB/s 2.1439 GiB/s]
                 change:
                        time:   [+1.0916% +1.1040% +1.1169%] (p = 0.00 < 0.05)
                        thrpt:  [−1.1046% −1.0919% −1.0799%]
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
decode/tango-icon-address-book-new-128-rsvg-convert.png
                        time:   [56.076 µs 56.077 µs 56.079 µs]
                        thrpt:  [1.0884 GiB/s 1.0884 GiB/s 1.0884 GiB/s]
                 change:
                        time:   [−0.2360% −0.1647% −0.1115%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1116% +0.1650% +0.2365%]
                        Change within noise threshold.
decode/tango-icon-address-book-new-16.png
                        time:   [4.7660 µs 4.7695 µs 4.7743 µs]
                        thrpt:  [204.55 MiB/s 204.75 MiB/s 204.90 MiB/s]
                 change:
                        time:   [+0.7646% +0.8426% +0.9170%] (p = 0.00 < 0.05)
                        thrpt:  [−0.9087% −0.8355% −0.7588%]
                        Change within noise threshold.
decode/tango-icon-address-book-new-32.png
                        time:   [8.0016 µs 8.0396 µs 8.1525 µs]
                        thrpt:  [479.15 MiB/s 485.88 MiB/s 488.18 MiB/s]
                 change:
                        time:   [−0.0613% +0.6959% +2.1059%] (p = 0.42 > 0.05)
                        thrpt:  [−2.0625% −0.6911% +0.0613%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

generated-noncompressed-4k-idat/8x8.png
                        time:   [685.14 ns 685.40 ns 685.71 ns]
                        thrpt:  [356.04 MiB/s 356.20 MiB/s 356.34 MiB/s]
                 change:
                        time:   [−0.0314% +0.0248% +0.0823%] (p = 0.41 > 0.05)
                        thrpt:  [−0.0822% −0.0248% +0.0314%]
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
generated-noncompressed-4k-idat/128x128.png
                        time:   [8.9758 µs 8.9768 µs 8.9778 µs]
                        thrpt:  [6.7984 GiB/s 6.7992 GiB/s 6.8000 GiB/s]
                 change:
                        time:   [+2.4892% +2.6191% +2.7237%] (p = 0.00 < 0.05)
                        thrpt:  [−2.6514% −2.5523% −2.4288%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
generated-noncompressed-4k-idat/2048x2048.png
                        time:   [1.9937 ms 2.0081 ms 2.0235 ms]
                        thrpt:  [7.7219 GiB/s 7.7809 GiB/s 7.8372 GiB/s]
                 change:
                        time:   [−10.610% −6.6236% −2.8469%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9303% +7.0935% +11.869%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking generated-noncompressed-4k-idat/12288x12288.png: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.5s or enable flat sampling.
generated-noncompressed-4k-idat/12288x12288.png
                        time:   [99.363 ms 99.636 ms 100.01 ms]
                        thrpt:  [5.6243 GiB/s 5.6455 GiB/s 5.6610 GiB/s]
                 change:
                        time:   [−0.2676% −0.0491% +0.1835%] (p = 0.71 > 0.05)
                        thrpt:  [−0.1832% +0.0491% +0.2683%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

generated-noncompressed-64k-idat/128x128.png
                        time:   [7.6085 µs 7.6092 µs 7.6099 µs]
                        thrpt:  [8.0205 GiB/s 8.0213 GiB/s 8.0220 GiB/s]
                 change:
                        time:   [+0.0696% +0.0869% +0.1040%] (p = 0.00 < 0.05)
                        thrpt:  [−0.1039% −0.0868% −0.0695%]
                        Change within noise threshold.

@197g
Copy link
Member Author

197g commented Sep 4, 2025

I feel uneasy about the fuzzer results. On the one hand, the kind of finding is just confirming the known bug from #605 and that's been around for a bit—we just now got it into the corpus at sufficiently shallow depth for CI to stumble onto it. Here's a file that even reproduces it on the main branch:
fuzzer-failure.zip (I think it originates from an oss-fuzz reproduction).

On the other hand it could mask some issues with this PR.

Copy link
Contributor

@anforowicz anforowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. Just a few drive-by comments from somebody who is not a maintainer (so take with a grain of salt).

I feel uneasy about the fuzzer results.

I wonder how we could increase our confidence that the new code works correctly.

I note that the new test verifies that the first 10 rows decode successfully, but it does not verify the actual pixels of those 10 rows. Maybe we could add a test that chops off the input at a random location and verifies that if some X initial rows can be decoded, then the resulting pixels are the same as in the X initial rows of the untruncated/unchopped input.

))?;

// Unfilter this in a temporary buffer.
let mut row = row[1..rowlen].to_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IIUC to_vec will always allocate here. Would it be possible to avoid doing such an allocation for ~every row in a partially read input? Can we allocate once and then reuse the allocation?

let end_of_new_prev = start_of_new_prev + padding;

// Shift everything up as required.
let current_shift = end_of_new_prev - potential_end_of_new_prev;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of reusing/shifting self.data_stream it may be a bit easier/clearer if a separate buffer (or two) would be introduced as a field of UnfilteringBuffer? Say copied_curr_row: Vec<u8> and copied_prev_row: Option<Vec<u8>> or something like that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and if we do that, then I guess the type of the current_start and prev_start fields would need to change from usize to something like Either<usize> (where Either::Left means the old behavior - data is in data_stream and where Either::Right means that the rows are in the new buffers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type change in current_start was the main concern that drove me away from that concept. I wanted to keep influence on the main path as low as possible, having an Either would be an extra match / conditional read on every single path accessing current_start—even the happy one. I can benchmark it but intuitively, it will be quite visible.

@fintelia
Copy link
Contributor

I think there might be a complication with this approach: fdeflate actually uses the size of the output buffer / the value of output_index to know whether a back-reference is valid or should trigger DistanceTooFarBack. If an EOF is encountered and a few rows of data are prepended to the unfiltering buffer, then subsequent calls to decompress may wrongly think that a back-reference is valid when it isn't.

This only applies until 32768 bytes of uncompressed output have been produced, at which point all possible distances are valid.

@anforowicz
Copy link
Contributor

I think there might be a complication with this approach: fdeflate actually uses the size of the output buffer / the value of output_index to know whether a back-reference is valid or should trigger DistanceTooFarBack. If an EOF is encountered and a few rows of data are prepended to the unfiltering buffer, then subsequent calls to decompress may wrongly think that a back-reference is valid when it isn't.

This doesn't seem like a hard blocker - we just need to ensure that output: &mut [u8] passed to fdeflate::Decompressor::read doesn't include the prepended/unfiltered rows, right? In other words, I think we would just need to use &mut buffer[available..output_limit] rather than &mut buffer[..output_limit] in zlib.rs here:

.read(data, &mut buffer[..output_limit], filled, false)
.

Does that sound reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected change of behavior when reading partial data

3 participants