Skip to content

Conversation

@reneme
Copy link
Collaborator

@reneme reneme commented May 21, 2025

This aims at removing the std::vector dependency from Cipher_Mode::finish_msg() to avoid unnecessary copies or allocations. Another goal is to factor out the offset parameter handling from the concrete implementations into the Cipher_Mode base class. Note that this is not meant to be public-facing API just yet. I'm focussing on the private customization point in this PR.

The idea is that the new (probably eventually public) method Cipher_Mode::bytes_needed_for_finalization() provides an exact number of bytes that it will need to finalize a cipher mode. With that, the base class (or eventually the application?) allocates enough memory and passes it into finish_msg() as a span. finish_msg() then finalizes the cipher operation in-place and returns the number of bytes of the span that are actually relevant and part of the result. With this, the base class can strip authentication tags from decryption results for instance.

Personally, I find this API functional enough for internal use but fairly involved for a public interface. I'm open for suggestions here. 🙂

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 22840e5 to 1acfbd6 Compare May 22, 2025 06:20
@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 2 times, most recently from b70452a to 848005a Compare May 30, 2025 14:50
reneme added a commit to reneme/botan that referenced this pull request Jun 1, 2025
With the current implementation of AEADs this doesn't matter. Though, given
that the implementations of AEAD::update() and ::finish() can change the
length of the in/out vector as a side-effect, this can result in unexpected
test failures that aren't actually the implementation's fault.

Found when refactoring the OCB mode in randombit#4880. When instantiated with a tag
length of 12, a call to ::finish() had already made room for the tag before
it failed (because of an expected state violation). Though, now the garbage
vector's size has changed and doesn't fit the update granularity down the
lane.
reneme added a commit to Rohde-Schwarz/botan that referenced this pull request Jun 1, 2025
With the current implementation of AEADs this doesn't matter. Though, given
that the implementations of AEAD::update() and ::finish() can change the
length of the in/out vector as a side-effect, this can result in unexpected
test failures that aren't actually the implementation's fault.

Found when refactoring the OCB mode in randombit#4880. When instantiated with a tag
length of 12, a call to ::finish() had already made room for the tag before
it failed (because of an expected state violation). Though, now the garbage
vector's size has changed and doesn't fit the update granularity down the
lane.
@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 3 times, most recently from c0b6471 to a93a6f1 Compare June 3, 2025 11:24
@coveralls
Copy link

coveralls commented Jun 3, 2025

Coverage Status

coverage: 90.588% (+0.02%) from 90.572%
when pulling 79ae69b on Rohde-Schwarz:refactor/span-for-cipher-mode-finalization
into e3b3452 on randombit:master.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from a93a6f1 to dd08532 Compare June 3, 2025 17:07
@reneme reneme marked this pull request as ready for review June 3, 2025 17:20
@reneme reneme requested a review from randombit June 3, 2025 17:20
@reneme
Copy link
Collaborator Author

reneme commented Jun 3, 2025

This is quite a hefty change, because of significant adaptions in all Cipher_Mode finalization implementations. Though, I think, the modernization is worthwhile. It does save allocations and copies in several places. Also, I find, the use of span makes for easier-to-understand code. There's more room for improvement in some implementations once process_msg also learns to deal with span, for sure.

Unfortunately, I don't see an obvious way to split this up into smaller patches. I'll do a self-review after walking away from it for a few days, though I'd really appreciate more eyes on this.

@reneme reneme requested review from FAlbertDev and atreiber94 June 3, 2025 17:27
@randombit

This comment was marked as resolved.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from dd08532 to d717fa9 Compare June 4, 2025 06:04
@reneme

This comment was marked as resolved.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from d717fa9 to 54efa82 Compare June 4, 2025 14:40
@reneme reneme removed the request for review from atreiber94 June 5, 2025 07:58
reneme

This comment was marked as resolved.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 54efa82 to 6c6f16a Compare June 10, 2025 12:27
@reneme
Copy link
Collaborator Author

reneme commented Jun 12, 2025

Ready for review, from my side. @randombit

@randombit

This comment was marked as resolved.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 6c6f16a to ee82637 Compare June 18, 2025 06:35
@reneme

This comment was marked as resolved.

@randombit
Copy link
Owner

Personally, I find this API functional enough for internal use but fairly involved for a public interface.

Yeah the interface for cipher modes is really fiddly. Was designed trying to balance usability vs minimizing copies, I'm not sure it really ended up in the right place. But I also have not come up with anything better so far. :/

@randombit
Copy link
Owner

In general the change looks ok. One thing I'm concerned about is that this seems to be adding quite a bit of overhead - though tbh right now I do not see where.

For example on master

CTR-BE(AES-128) encrypt buffer size 8 bytes: 685.081 MiB/sec 4.17 cycles/byte (342.56 MiB in 500.03 ms)

this branch

CTR-BE(AES-128) encrypt buffer size 8 bytes: 544.767 MiB/sec 5.24 cycles/byte (272.44 MiB in 500.10 ms)

In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable. I see it also in other ciphers (eg ~7% regression for AES-128-GCM with 8 byte messages)

@reneme
Copy link
Collaborator Author

reneme commented Jun 27, 2025

In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable.

Good point to measure it like that. That's surprising to me as well. I'll look into it, though it may have to wait a bit, unfortunately.

Edit: I couldn't wait to take a quick glance after all. Especially the CTR mode becoming slower is very unexpected. The finalization for stream ciphers is essentially a NOOP. Also, in the Stream_Mode base class I moved from calling update(), to process() which actually removes an indirection (albeit one that was likely inlined anyway), and with that a call to std::vector::resize() (that would always be a NOOP, as well).

The only thing I could immediately blame this on is the new virtual method call (bytes_needed_for_finalization()) in the generic implementation of finish(), that is useless for stream ciphers but can't be avoided in our class hierarchy.

A quick test on my M2 MacBook Air over breakfast coffee does not reproduce this finding on my machine, though:

➜  botan git:(master) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.325 MiB/sec (1102.688 MiB in 2000.069 ms)

➜  botan git:(refactor/span-for-cipher-mode-finalization) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.307 MiB/sec (1102.625 MiB in 2000.021 ms)

... multiple runs jitter between roughly 548MiB/s and 560MiB/s in both cases.

@randombit
Copy link
Owner

It's very reliable for me on a Tiger Lake laptop. There is some noise to be sure (~5% back and forth depending on the run) but this is more like a 25%

master (Clang 20):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 687.218 MiB/sec 4.16 cycles/byte (343.62 MiB in 500.02 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 595.302 MiB/sec 4.80 cycles/byte (297.69 MiB in 500.06 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1149.485 MiB/sec 2.48 cycles/byte (574.75 MiB in 500.01 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1050.926 MiB/sec 2.72 cycles/byte (525.50 MiB in 500.04 ms)

here (Clang 20):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 553.520 MiB/sec 5.16 cycles/byte (276.81 MiB in 500.10 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 479.829 MiB/sec 5.95 cycles/byte (239.94 MiB in 500.05 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 980.258 MiB/sec 2.91 cycles/byte (490.19 MiB in 500.06 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 863.294 MiB/sec 3.31 cycles/byte (431.69 MiB in 500.05 ms)

master (GCC 15):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 843.907 MiB/sec 3.38 cycles/byte (422.00 MiB in 500.06 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 695.764 MiB/sec 4.11 cycles/byte (347.94 MiB in 500.08 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1557.379 MiB/sec 1.83 cycles/byte (778.75 MiB in 500.04 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1306.543 MiB/sec 2.19 cycles/byte (653.31 MiB in 500.03 ms)

here (GCC 15):
CTR-BE(AES-128) encrypt buffer size 8 bytes: 710.878 MiB/sec 4.02 cycles/byte (355.50 MiB in 500.09 ms)
CTR-BE(AES-128) decrypt buffer size 8 bytes: 597.441 MiB/sec 4.78 cycles/byte (298.75 MiB in 500.05 ms)
CTR-BE(AES-128) encrypt buffer size 16 bytes: 1341.603 MiB/sec 2.13 cycles/byte (670.81 MiB in 500.01 ms)
CTR-BE(AES-128) decrypt buffer size 16 bytes: 1138.683 MiB/sec 2.51 cycles/byte (569.38 MiB in 500.03 ms)

I'll check on another machine

@randombit
Copy link
Owner

randombit commented Jun 28, 2025

Did some experimenting with disabling different parts of finish to see what the overhead is. The issue is definitely with the virtual calls. Basically I'm seeing a ~~ 25% perf drop (in this one specific small message/fast cipher case) and if I comment out

           if(final_input_bytes < minimum_final_size()) {
              throw Invalid_Argument("final input message is too small");
           }

I get ~12% back.

And if I replace

           const auto final_buffer_bytes = bytes_needed_for_finalization(final_input_bytes);

           // Make room for additional overhead to be produced during finalization
           if(final_input_bytes < final_buffer_bytes) {
              final_block.resize(offset + final_buffer_bytes);
           }

with

const size_t final_buffer_bytes = final_input_bytes;

(only valid for stream modes obviously)

I get back the other ~12%

So it seems the virtual calls are more costly on this laptop that on your machine. Or possibly the M2's AES hardware is not fast enough that the overhead becomes visible? IDK

So one immediate suggestion that seems honestly an improvement anyway, regardless of the overhead issue:

Each implementation of bytes_needed_for_finalization should itself check if the input value is less than minimum_final_size and if so throw. Then 2 virtual calls becomes 1 (since bytes_needed_for_finalization occurs in final leaf classes and so the compiler can devirtualize the minimum_final_size call)

randombit

This comment was marked as resolved.

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 3 times, most recently from 9a4b6e2 to a8d23fc Compare July 11, 2025 15:47
@reneme
Copy link
Collaborator Author

reneme commented Jul 11, 2025

Each implementation of bytes_needed_for_finalization should itself check if the input value is less than minimum_final_size and if so throw. Then 2 virtual calls becomes 1 (since bytes_needed_for_finalization occurs in final leaf classes and so the compiler can devirtualize the minimum_final_size call)

Very much a fair point!

I did what you suggested (in this commit: 79ae69b) and it turns out that most implementations of bytes_needed_for_finalization didn't even need to be adapted. Either because minimum_final_size was 'zero' anyway or because they already contained the check. The new function calculate_final_input_bytes (implemented in the *.cpp) now takes care of both the offset check (mentioned here) and also a debug-assert as a safety net for the implementation specific check in bytes_needed_for_finalization.

Interestingly, the debug-assert forced me to adapt a few negative tests that were meant to trigger an Invalid_State exception but ran into the Invalid_Argument instead. 🙄 I pulled these test adaptions into an independent pull request and rebased this PR after merging it.

As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access. One day, perhaps...

@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch 3 times, most recently from 5a17678 to 3a965d6 Compare July 11, 2025 16:45
@reneme reneme requested a review from randombit July 11, 2025 16:55
@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 3a965d6 to 9b2e92a Compare July 12, 2025 13:35
@reneme
Copy link
Collaborator Author

reneme commented Jul 12, 2025

Rebased after merging #4970. I'm happy to report that the stricter exception expectations did uncover another issue in commoncrypto that could have potentially caused actual harm!

Due to a missing "key is set" check in CommonCrypto_Cipher_Mode::bytes_needed_for_finalization a call to finish() on an unkeyed object called CCCryptorGetOutputLength with a nullptr handle. This resulted in an unreasonably large "expected output length" (probably some UB). The generic Cipher_Mode::finish() implementation then tried to allocate a huge vector causing some STL exception to be thrown.

Before #4970 this STL exception was 'good enough', now it was flagged as unexpected. 😎

reneme added 2 commits July 12, 2025 17:54
Previously, the internal customization point finish_msg() did take a
std::vector and re-allocated it to accomodate or discard encryption
overhead such as paddings and/or AEAD tags. This forces downstream users
into using a std::vector and potentially copy data, even when they could
pre-allocate enough memory out-of-band in their application.

Additionally, an offset was provided to deal with prefix bytes within
the vector that should not be processed. The handling of this offset was
duplicated in each concrete Cipher_Mode implementation. This is now
factored out into the base class.

The changes in this commit are not public-facing as they just affect the
internal customization point.
@reneme reneme force-pushed the refactor/span-for-cipher-mode-finalization branch from 9b2e92a to 79ae69b Compare July 12, 2025 15:54
@randombit
Copy link
Owner

As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access.

Another (completely orthogonal) idea I had while thinking about this PR is that Cipher_Mode was ill-designed and it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.

This is still fixable, since happily all of the cipher implementations are themselves private. And at least unlike a refactoring of an existing API like here, it can be done incrementally one mode at a time. But it should probably happen as part of a larger rethink of how cipher operations are expressed (see mistakes.rst)

@reneme
Copy link
Collaborator Author

reneme commented Jul 14, 2025

... it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.

Yes, full ack. I spent the morning trying to come up with a decent public API for a new Cipher_Mode::finish and got lost in the details. From what I saw, those are the complexities it would have to handle (essentially all at once):

  • Encryption
    • Addition of the tag (usually at the end, but sometimes in the front, e.g. SIV)
    • Addition of a padding, or ciphertext-stealing-style finalizations
    • Sometimes both, e.g. CBC+MAC respectively MAC+CBC in TLS 1.2
  • Decryption
    • Reading/Verifying the tag, then removing it
    • Reading/Verifying paddings or CTS's, then maybe removing it
  • In general, usually mixed with the above
    • Buffered modes producing the entire ciphertext at the end, even when using process() or update(), along with padding/tag addition/removal

At this point, I would probably just expose an equivalent version of finish_msg() for users that want to avoid the allocations/copies (e.g. #4697) that may come with the existing std::vector-based version. It's not very ergonomic but I doubt its going to get much better anyway. And for most people, the existing API is likely just fine for now.

@reneme
Copy link
Collaborator Author

reneme commented Sep 4, 2025

@randombit from my side this is ready to go in. Or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants