Refactor: Cipher_Mode::finish_msg() using std::span #4880

reneme · 2025-05-21T18:19:57Z

This aims at removing the std::vector dependency from Cipher_Mode::finish_msg() to avoid unnecessary copies or allocations. Another goal is to factor out the offset parameter handling from the concrete implementations into the Cipher_Mode base class. Note that this is not meant to be public-facing API just yet. I'm focussing on the private customization point in this PR.

The idea is that the new (probably eventually public) method Cipher_Mode::bytes_needed_for_finalization() provides an exact number of bytes that it will need to finalize a cipher mode. With that, the base class (or eventually the application?) allocates enough memory and passes it into finish_msg() as a span. finish_msg() then finalizes the cipher operation in-place and returns the number of bytes of the span that are actually relevant and part of the result. With this, the base class can strip authentication tags from decryption results for instance.

Personally, I find this API functional enough for internal use but fairly involved for a public interface. I'm open for suggestions here. 🙂

src/lib/modes/cipher_mode.h

src/lib/modes/cipher_mode.cpp

With the current implementation of AEADs this doesn't matter. Though, given that the implementations of AEAD::update() and ::finish() can change the length of the in/out vector as a side-effect, this can result in unexpected test failures that aren't actually the implementation's fault. Found when refactoring the OCB mode in randombit#4880. When instantiated with a tag length of 12, a call to ::finish() had already made room for the tag before it failed (because of an expected state violation). Though, now the garbage vector's size has changed and doesn't fit the update granularity down the lane.

coveralls · 2025-06-03T15:13:08Z

coverage: 90.588% (+0.02%) from 90.572%
when pulling 79ae69b on Rohde-Schwarz:refactor/span-for-cipher-mode-finalization
into e3b3452 on randombit:master.

reneme · 2025-06-03T17:27:09Z

This is quite a hefty change, because of significant adaptions in all Cipher_Mode finalization implementations. Though, I think, the modernization is worthwhile. It does save allocations and copies in several places. Also, I find, the use of span makes for easier-to-understand code. There's more room for improvement in some implementations once process_msg also learns to deal with span, for sure.

Unfortunately, I don't see an obvious way to split this up into smaller patches. I'll do a self-review after walking away from it for a few days, though I'd really appreciate more eyes on this.

src/lib/modes/aead/ccm/ccm.cpp

src/lib/modes/xts/xts.cpp

src/lib/modes/cipher_mode.h

reneme · 2025-06-12T08:09:47Z

Ready for review, from my side. @randombit

randombit · 2025-06-26T22:18:48Z

Personally, I find this API functional enough for internal use but fairly involved for a public interface.

Yeah the interface for cipher modes is really fiddly. Was designed trying to balance usability vs minimizing copies, I'm not sure it really ended up in the right place. But I also have not come up with anything better so far. :/

randombit · 2025-06-26T22:28:56Z

In general the change looks ok. One thing I'm concerned about is that this seems to be adding quite a bit of overhead - though tbh right now I do not see where.

For example on master

CTR-BE(AES-128) encrypt buffer size 8 bytes: 685.081 MiB/sec 4.17 cycles/byte (342.56 MiB in 500.03 ms)

this branch

CTR-BE(AES-128) encrypt buffer size 8 bytes: 544.767 MiB/sec 5.24 cycles/byte (272.44 MiB in 500.10 ms)

In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable. I see it also in other ciphers (eg ~7% regression for AES-128-GCM with 8 byte messages)

reneme · 2025-06-27T04:07:10Z

In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable.

Good point to measure it like that. That's surprising to me as well. I'll look into it, ~~though it may have to wait a bit, unfortunately~~.

Edit: I couldn't wait to take a quick glance after all. Especially the CTR mode becoming slower is very unexpected. The finalization for stream ciphers is essentially a NOOP. Also, in the Stream_Mode base class I moved from calling update(), to process() which actually removes an indirection (albeit one that was likely inlined anyway), and with that a call to std::vector::resize() (that would always be a NOOP, as well).

The only thing I could immediately blame this on is the new virtual method call (bytes_needed_for_finalization()) in the generic implementation of finish(), that is useless for stream ciphers but can't be avoided in our class hierarchy.

A quick test on my M2 MacBook Air over breakfast coffee does not reproduce this finding on my machine, though:

➜  botan git:(master) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.325 MiB/sec (1102.688 MiB in 2000.069 ms)

➜  botan git:(refactor/span-for-cipher-mode-finalization) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.307 MiB/sec (1102.625 MiB in 2000.021 ms)

... multiple runs jitter between roughly 548MiB/s and 560MiB/s in both cases.

randombit · 2025-06-28T11:25:33Z

It's very reliable for me on a Tiger Lake laptop. There is some noise to be sure (~5% back and forth depending on the run) but this is more like a 25%

master (Clang 20):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 687.218 MiB/sec 4.16 cycles/byte (343.62 MiB in 500.02 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 595.302 MiB/sec 4.80 cycles/byte (297.69 MiB in 500.06 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1149.485 MiB/sec 2.48 cycles/byte (574.75 MiB in 500.01 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1050.926 MiB/sec 2.72 cycles/byte (525.50 MiB in 500.04 ms)

here (Clang 20):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 553.520 MiB/sec 5.16 cycles/byte (276.81 MiB in 500.10 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 479.829 MiB/sec 5.95 cycles/byte (239.94 MiB in 500.05 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 980.258 MiB/sec 2.91 cycles/byte (490.19 MiB in 500.06 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 863.294 MiB/sec 3.31 cycles/byte (431.69 MiB in 500.05 ms)

master (GCC 15):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 843.907 MiB/sec 3.38 cycles/byte (422.00 MiB in 500.06 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 695.764 MiB/sec 4.11 cycles/byte (347.94 MiB in 500.08 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1557.379 MiB/sec 1.83 cycles/byte (778.75 MiB in 500.04 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1306.543 MiB/sec 2.19 cycles/byte (653.31 MiB in 500.03 ms)

here (GCC 15):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 710.878 MiB/sec 4.02 cycles/byte (355.50 MiB in 500.09 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 597.441 MiB/sec 4.78 cycles/byte (298.75 MiB in 500.05 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1341.603 MiB/sec 2.13 cycles/byte (670.81 MiB in 500.01 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1138.683 MiB/sec 2.51 cycles/byte (569.38 MiB in 500.03 ms)

I'll check on another machine

randombit · 2025-06-28T11:45:43Z

Did some experimenting with disabling different parts of finish to see what the overhead is. The issue is definitely with the virtual calls. Basically I'm seeing a ~~ 25% perf drop (in this one specific small message/fast cipher case) and if I comment out

           if(final_input_bytes < minimum_final_size()) {
              throw Invalid_Argument("final input message is too small");
           }

I get ~12% back.

And if I replace

           const auto final_buffer_bytes = bytes_needed_for_finalization(final_input_bytes);

           // Make room for additional overhead to be produced during finalization
           if(final_input_bytes < final_buffer_bytes) {
              final_block.resize(offset + final_buffer_bytes);
           }

with

const size_t final_buffer_bytes = final_input_bytes;

(only valid for stream modes obviously)

I get back the other ~12%

So it seems the virtual calls are more costly on this laptop that on your machine. Or possibly the M2's AES hardware is not fast enough that the overhead becomes visible? IDK

So one immediate suggestion that seems honestly an improvement anyway, regardless of the overhead issue:

Each implementation of bytes_needed_for_finalization should itself check if the input value is less than minimum_final_size and if so throw. Then 2 virtual calls becomes 1 (since bytes_needed_for_finalization occurs in final leaf classes and so the compiler can devirtualize the minimum_final_size call)

reneme · 2025-07-11T15:57:55Z

Each implementation of bytes_needed_for_finalization should itself check if the input value is less than minimum_final_size and if so throw. Then 2 virtual calls becomes 1 (since bytes_needed_for_finalization occurs in final leaf classes and so the compiler can devirtualize the minimum_final_size call)

Very much a fair point!

I did what you suggested (in this commit: 79ae69b) and it turns out that most implementations of bytes_needed_for_finalization didn't even need to be adapted. Either because minimum_final_size was 'zero' anyway or because they already contained the check. The new function calculate_final_input_bytes (implemented in the *.cpp) now takes care of both the offset check (mentioned here) and also a debug-assert as a safety net for the implementation specific check in bytes_needed_for_finalization.

Interestingly, the debug-assert forced me to adapt a few negative tests that were meant to trigger an Invalid_State exception but ran into the Invalid_Argument instead. 🙄 I pulled these test adaptions into an independent pull request and rebased this PR after merging it.

As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access. One day, perhaps...

reneme · 2025-07-12T13:42:34Z

Rebased after merging #4970. I'm happy to report that the stricter exception expectations did uncover another issue in commoncrypto that could have potentially caused actual harm!

Due to a missing "key is set" check in CommonCrypto_Cipher_Mode::bytes_needed_for_finalization a call to finish() on an unkeyed object called CCCryptorGetOutputLength with a nullptr handle. This resulted in an unreasonably large "expected output length" (probably some UB). The generic Cipher_Mode::finish() implementation then tried to allocate a huge vector causing some STL exception to be thrown.

Before #4970 this STL exception was 'good enough', now it was flagged as unexpected. 😎

Previously, the internal customization point finish_msg() did take a std::vector and re-allocated it to accomodate or discard encryption overhead such as paddings and/or AEAD tags. This forces downstream users into using a std::vector and potentially copy data, even when they could pre-allocate enough memory out-of-band in their application. Additionally, an offset was provided to deal with prefix bytes within the vector that should not be processed. The handling of this offset was duplicated in each concrete Cipher_Mode implementation. This is now factored out into the base class. The changes in this commit are not public-facing as they just affect the internal customization point.

randombit · 2025-07-14T11:39:44Z

As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access.

Another (completely orthogonal) idea I had while thinking about this PR is that Cipher_Mode was ill-designed and it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.

This is still fixable, since happily all of the cipher implementations are themselves private. And at least unlike a refactoring of an existing API like here, it can be done incrementally one mode at a time. But it should probably happen as part of a larger rethink of how cipher operations are expressed (see mistakes.rst)

reneme · 2025-07-14T13:05:14Z

... it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.

Yes, full ack. I spent the morning trying to come up with a decent public API for a new Cipher_Mode::finish and got lost in the details. From what I saw, those are the complexities it would have to handle (essentially all at once):

Encryption
- Addition of the tag (usually at the end, but sometimes in the front, e.g. SIV)
- Addition of a padding, or ciphertext-stealing-style finalizations
- Sometimes both, e.g. CBC+MAC respectively MAC+CBC in TLS 1.2
Decryption
- Reading/Verifying the tag, then removing it
- Reading/Verifying paddings or CTS's, then maybe removing it
In general, usually mixed with the above
- Buffered modes producing the entire ciphertext at the end, even when using process() or update(), along with padding/tag addition/removal

At this point, I would probably just expose an equivalent version of finish_msg() for users that want to avoid the allocations/copies (e.g. #4697) that may come with the existing std::vector-based version. It's not very ergonomic but I doubt its going to get much better anyway. And for most people, the existing API is likely just fine for now.

reneme · 2025-09-04T12:04:19Z

@randombit from my side this is ready to go in. Or am I missing something?

reneme commented May 21, 2025

View reviewed changes

src/lib/modes/cipher_mode.h Outdated Show resolved Hide resolved

reneme commented May 21, 2025

View reviewed changes

src/lib/modes/cipher_mode.cpp Outdated Show resolved Hide resolved

reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 22840e5 to 1acfbd6 Compare May 22, 2025 06:20

reneme mentioned this pull request May 30, 2025

Test: add some KATs for TLS 1.2 CBC AEAD #4894

Merged

reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 2 times, most recently from b70452a to 848005a Compare May 30, 2025 14:50

reneme mentioned this pull request Jun 1, 2025

FIX: create fresh garbage for each negative test #4895

Merged

reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 3 times, most recently from c0b6471 to a93a6f1 Compare June 3, 2025 11:24

reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from a93a6f1 to dd08532 Compare June 3, 2025 17:07

reneme marked this pull request as ready for review June 3, 2025 17:20

reneme requested a review from randombit June 3, 2025 17:20

reneme requested review from FAlbertDev and atreiber94 June 3, 2025 17:27