-
Notifications
You must be signed in to change notification settings - Fork 622
Refactor: Cipher_Mode::finish_msg() using std::span #4880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Refactor: Cipher_Mode::finish_msg() using std::span #4880
Conversation
22840e5 to
1acfbd6
Compare
b70452a to
848005a
Compare
With the current implementation of AEADs this doesn't matter. Though, given that the implementations of AEAD::update() and ::finish() can change the length of the in/out vector as a side-effect, this can result in unexpected test failures that aren't actually the implementation's fault. Found when refactoring the OCB mode in randombit#4880. When instantiated with a tag length of 12, a call to ::finish() had already made room for the tag before it failed (because of an expected state violation). Though, now the garbage vector's size has changed and doesn't fit the update granularity down the lane.
With the current implementation of AEADs this doesn't matter. Though, given that the implementations of AEAD::update() and ::finish() can change the length of the in/out vector as a side-effect, this can result in unexpected test failures that aren't actually the implementation's fault. Found when refactoring the OCB mode in randombit#4880. When instantiated with a tag length of 12, a call to ::finish() had already made room for the tag before it failed (because of an expected state violation). Though, now the garbage vector's size has changed and doesn't fit the update granularity down the lane.
c0b6471 to
a93a6f1
Compare
a93a6f1 to
dd08532
Compare
|
This is quite a hefty change, because of significant adaptions in all Unfortunately, I don't see an obvious way to split this up into smaller patches. I'll do a self-review after walking away from it for a few days, though I'd really appreciate more eyes on this. |
This comment was marked as resolved.
This comment was marked as resolved.
dd08532 to
d717fa9
Compare
This comment was marked as resolved.
This comment was marked as resolved.
d717fa9 to
54efa82
Compare
54efa82 to
6c6f16a
Compare
|
Ready for review, from my side. @randombit |
This comment was marked as resolved.
This comment was marked as resolved.
6c6f16a to
ee82637
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Yeah the interface for cipher modes is really fiddly. Was designed trying to balance usability vs minimizing copies, I'm not sure it really ended up in the right place. But I also have not come up with anything better so far. :/ |
|
In general the change looks ok. One thing I'm concerned about is that this seems to be adding quite a bit of overhead - though tbh right now I do not see where. For example on master CTR-BE(AES-128) encrypt buffer size 8 bytes: 685.081 MiB/sec 4.17 cycles/byte (342.56 MiB in 500.03 ms) this branch CTR-BE(AES-128) encrypt buffer size 8 bytes: 544.767 MiB/sec 5.24 cycles/byte (272.44 MiB in 500.10 ms) In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable. I see it also in other ciphers (eg ~7% regression for AES-128-GCM with 8 byte messages) |
Good point to measure it like that. That's surprising to me as well. I'll look into it, Edit: I couldn't wait to take a quick glance after all. Especially the CTR mode becoming slower is very unexpected. The finalization for stream ciphers is essentially a NOOP. Also, in the The only thing I could immediately blame this on is the new virtual method call ( A quick test on my M2 MacBook Air over breakfast coffee does not reproduce this finding on my machine, though: ... multiple runs jitter between roughly 548MiB/s and 560MiB/s in both cases. |
|
It's very reliable for me on a Tiger Lake laptop. There is some noise to be sure (~5% back and forth depending on the run) but this is more like a 25% master (Clang 20): here (Clang 20): master (GCC 15): here (GCC 15): I'll check on another machine |
|
Did some experimenting with disabling different parts of I get ~12% back. And if I replace with (only valid for stream modes obviously) I get back the other ~12% So it seems the virtual calls are more costly on this laptop that on your machine. Or possibly the M2's AES hardware is not fast enough that the overhead becomes visible? IDK So one immediate suggestion that seems honestly an improvement anyway, regardless of the overhead issue: Each implementation of |
9a4b6e2 to
a8d23fc
Compare
Very much a fair point! I did what you suggested (in this commit: 79ae69b) and it turns out that most implementations of Interestingly, the debug-assert forced me to adapt a few negative tests that were meant to trigger an As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access. One day, perhaps... |
5a17678 to
3a965d6
Compare
3a965d6 to
9b2e92a
Compare
|
Rebased after merging #4970. I'm happy to report that the stricter exception expectations did uncover another issue in commoncrypto that could have potentially caused actual harm! Due to a missing "key is set" check in Before #4970 this STL exception was 'good enough', now it was flagged as unexpected. 😎 |
Previously, the internal customization point finish_msg() did take a std::vector and re-allocated it to accomodate or discard encryption overhead such as paddings and/or AEAD tags. This forces downstream users into using a std::vector and potentially copy data, even when they could pre-allocate enough memory out-of-band in their application. Additionally, an offset was provided to deal with prefix bytes within the vector that should not be processed. The handling of this offset was duplicated in each concrete Cipher_Mode implementation. This is now factored out into the base class. The changes in this commit are not public-facing as they just affect the internal customization point.
9b2e92a to
79ae69b
Compare
Another (completely orthogonal) idea I had while thinking about this PR is that This is still fixable, since happily all of the cipher implementations are themselves private. And at least unlike a refactoring of an existing API like here, it can be done incrementally one mode at a time. But it should probably happen as part of a larger rethink of how cipher operations are expressed (see mistakes.rst) |
Yes, full ack. I spent the morning trying to come up with a decent public API for a new
At this point, I would probably just expose an equivalent version of |
|
@randombit from my side this is ready to go in. Or am I missing something? |
This aims at removing the
std::vectordependency fromCipher_Mode::finish_msg()to avoid unnecessary copies or allocations. Another goal is to factor out theoffsetparameter handling from the concrete implementations into theCipher_Modebase class. Note that this is not meant to be public-facing API just yet. I'm focussing on the private customization point in this PR.The idea is that the new (probably eventually public) method
Cipher_Mode::bytes_needed_for_finalization()provides an exact number of bytes that it will need to finalize a cipher mode. With that, the base class (or eventually the application?) allocates enough memory and passes it intofinish_msg()as a span.finish_msg()then finalizes the cipher operation in-place and returns the number of bytes of the span that are actually relevant and part of the result. With this, the base class can strip authentication tags from decryption results for instance.Personally, I find this API functional enough for internal use but fairly involved for a public interface. I'm open for suggestions here. 🙂