Skip to content

OHTTP keys should be rotated#1449

Open
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:key-rotation
Open

OHTTP keys should be rotated#1449
zealsham wants to merge 1 commit intopayjoin:masterfrom
zealsham:key-rotation

Conversation

@zealsham
Copy link
Copy Markdown
Collaborator

@zealsham zealsham commented Mar 29, 2026

This pr addresses #445. It implements OHTTP-key rotation to payjoin-mailroom Mailroom operators can now decide the time interval for keys to be rotated. Also if a key has expired, a 422 error is returned to clients. Clients can handle they key-rotation via the cach-control header returned by the directory.

Pull Request Checklist

Please confirm the following before requesting review:

@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Mar 29, 2026

Pull Request Test Coverage Report for Build 23958918411

Details

  • 171 of 258 (66.28%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.4%) to 83.932%

Changes Missing Coverage Covered Lines Changed/Added Lines %
payjoin-mailroom/src/key_config.rs 77 87 88.51%
payjoin-mailroom/src/lib.rs 19 30 63.33%
payjoin-mailroom/src/config.rs 2 14 14.29%
payjoin-mailroom/src/directory.rs 71 125 56.8%
Files with Coverage Reduction New Missed Lines %
payjoin-mailroom/src/key_config.rs 1 90.27%
Totals Coverage Status
Change from base Build 23927050434: -0.4%
Covered Lines: 10928
Relevant Lines: 13020

💛 - Coveralls

Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max-age handling as implemented will not be correct without an Age (or possibly Date) header. i also have a bunch of suggestions to simplify, some more to do with the code itself others with the approach

listener: "[::]:8080".parse().expect("valid default listener address"),
storage_dir: PathBuf::from("./data"),
timeout: Duration::from_secs(30),
ohttp_keys_max_age: Some(Duration::from_secs(604800)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ohttp_keys_max_age: Some(Duration::from_secs(604800)),
ohttp_keys_max_age: Some(Duration::from_secs(7 * 24 * 60 * 60)),

easier to read IMO. Duration::from_weeks also exists but only in nightly

{
// duration value of 0 isn't accepted
let secs: Option<u64> = Option::deserialize(deserializer)?;
Ok(secs.filter(|&s| s > 0).map(Duration::from_secs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an invalid duration is given this will silently deserialize as Ok(None) instead of giving an error

#[derive(Debug)]
pub struct OhttpKeySet {
pub current_key_id: u8,
pub servers: BTreeMap<u8, ohttp::Server>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this pub and named servers which seems like a reflection of an implementation detail?

also BTreeMap seems overkill if this is constrained to have only IDs 1 and 2. an [Option<ohttp::Server>; 2] could be used and key ID used to derive an index by subtracting 1, seems simpler?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!, my initial implementation wasn't deleting old keys automatically and would load them back in memory on restart but only accept the one with largest id number. i'll change this

/// All keys in the `servers` map are accepted for decapsulation, allowing clients
/// with cached older keys to continue working during the overlap window.
#[derive(Debug)]
pub struct OhttpKeySet {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this struct seems a bit at odds with its functionality, which handles OHTTP decapsulation as well, maybe KeyRotatingServer is more appropriate?

if let Some(max_age) = self.ohttp_keys_max_age {
res.headers_mut().insert(
CACHE_CONTROL,
HeaderValue::from_str(&format!("public, max-age={}", max_age.as_secs()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will specify the max age of the HTTP resource relative to the time at which the response was generated, but the semantics for the key expiration and self.ohttp_keys_max_age is relative to when the key was initialized

the Age header can be used to specify the age of the response, which is to be subtracted from max_age or this could be done server side

s-maxage is arguably the more appropriate cache control directive to use since this can be shared (c.f. key consistency expired draft), or the equivalent public, max-age=... can be used. immutable can also be set.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first time seeing the "immutable" cache header

implementing this .

/// A new key with the next key_id is generated and persisted to disk.
/// The new key becomes the current key served to clients.
/// After `max_age` elapses the old key is dropped from memory and its .ikm` file is deleted from disk.
pub fn spawn_key_rotation(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it'd be simpler to do the keygen logic in the request handling path instead of as a background task?

another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to potentially simplify this is to use a KDF to generate keys as a function of time, i.e. divide the current time by the max key age, use that as a nonce/counter in a KDF based on a single persisted master key, and just generate the per-epoch key deterministically, that would remove the need for persisting the key material on disk (and so avoid any consistency issues and the need for a sync() call while the rwlock is held to ensure that key material is in durable storage)

not sure i completely understand what you mean here. could this be elaborated or discussed in the upcoming mob programming session if you got the time ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think my suggestion was a mistake, see below for reasoning. at any rate to answer the question:

simplest version:

generate a root secret, save it on disk

current epoch number = current unix time / epoch length

epoch key = KDF(root secret, current epoch number)

so at every point in time there's a deterministically derived epoch key

ratcheting version:

generate initial epoch key, save it on disk

epoch key = H(previous epoch key), overwrite previous saved epoch key

but thinking about this more, assuming a temporary compromise (same as in forward secrecy) where a key from an earlier epoch was leaked, and after that the adversary loses access and then a new randomly generated key is introduced, then secrecy will be recovered. this is no longer the case with a deterministic approach.

it could be argued that this model is too optimistic for real world settings. once key material is exflitrated, supposing the compromise is detected and a new host is provisioned... what is gained by copying the key key to the new instance? i think it's better to revoke it entirely (i.e. respond with 422 and force re-configuration).

on the other hand, if compromise isn't detected so no new host is provisioned, why would a real world access lose the ability to compromise the host again if they have enough access to exfiltrate key material at an earlier time? that seems like a pretty contrived scenario, it could be realized using scheduled reboots and ephemeral root filesystem combined to ensure that persistent access can't be set up, and with regular updates that might patch the vulnerability recovering from the compromise... anyway the point is actually realizing this threat model takes significant effort.

so, if this reason is not strong enough to want to do this, what other reasons are there to rotate keys? i can't think of one... and if it is strong enough, we definitely need the stateful version you've implemented, not a deterministic approach since that seems to invalidate the only justification

@zealsham zealsham force-pushed the key-rotation branch 2 times, most recently from 71abcf7 to aa33a74 Compare April 2, 2026 16:59
.and_then(|i| self.keys[i as usize].as_ref());
match server {
Some(s) => s.decapsulate(ohttp_body),
None => Err(ohttp::Error::Truncated),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
None => Err(ohttp::Error::Truncated),
None => Err(ohttp::Error::KeyId),

why truncated?

if self.ohttp.read().await.current_key_created_at().elapsed() < max_age {
return Ok(());
}
let mut keyset = self.ohttp.write().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both keyset and self.ohttp are hard to read... i would consider renaming, maybe self.servers_by_keyid?

Comment on lines +488 to +497
let new_key_id = keyset.next_key_id();
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
}
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems cleaner to try generating before zapping the old key

Suggested change
let new_key_id = keyset.next_key_id();
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
}
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}
let new_key_id = keyset.next_key_id();
let config = crate::key_config::gen_ohttp_server_config_with_id(new_key_id)
.map_err(HandlerError::InternalServerError)?;
if let Some(dir) = &self.ohttp_keys_dir {
let _ = tokio::fs::remove_file(dir.join(format!("{new_key_id}.ikm"))).await;
crate::key_config::persist_key_config(&config, dir)
.map_err(HandlerError::InternalServerError)?;
}

but actually what i would prefer even more is if the old key was cleared sooner, so that the array contains an Option and a None most of the time, and the two keyids are only valid for a grace period accounting for reasonable clock skew and latency (so on the order of a few seconds)

if two keyids are always available for use, then clients using the old key leak information about their local clock, which is a strong fingerprint

that said we definitely should support two keyids during a short time window, because if only one key is ever available then clients who are slightly ahead will expire their key but then fetch the same key and effectively need to poll the ohttp configuration until it changes, which is wasteful, or clients who are behind will make requests with a stale key, get an error, need to do key configuration, and then submit the request under the new key. since this is wasteful and degrades ux, some allowance is necessary.

}
}

async fn maybe_rotate_keys(&self) -> Result<(), HandlerError> {
Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since i retracted my suggestion to do deterministic keys, i think it also no longer makes sense to do this on the request path, and it should go back to being done in a background loop

the deterministic approach would not have required an RwLock or any IO... the combination of both of these things including IO that happens while holding the RwLock will now potentially block long running requests, and long polling requests which will hold a read() lock for their entire duration.

rwlock does not define reader/writer priority, so depending on the OS either many long polls will block key rotation until all long poll requests expire (though this can be fixed, see other comment), and any new requests will likewise block while attempting to obtain an rwlock because the current key expired

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the read guard just after decapsulation fixes this in a way. Although i'm dropping the function entirely in favour of the background task one

let (bhttp_req, res_ctx) = self
.ohttp
// Decapsulate OHTTP request using the key matching the message's key_id
let keyset = self.ohttp.read().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this holds the rwlock with read for entirety of the request handling, but it could be dropped right after decapsulation since res_ctx contains all the information required to respond

@nothingmuch
Copy link
Copy Markdown
Contributor

a simpler way to support two keyids only during transition window is to always have two of them, no Option, instead keep a valid_from time for each separately.

after valid_from + ttl + grace period, a background loop can overwrite the inactive key with a new one.

when decapsulating, if the key is stale but hasn't been overwritten yet, it's still in the grace period so the request should be accepted. if the key is stale then the keyid is now representing the key of the next epoch, but that hasn't been served via config yet so it's impossible for any client to use it.

then the config can start serving the next key ~1/2 of the grace period before the current key expires (valid_from + ttl), it should already usable because it was overwritten shortly after the end of the last epoch, and serving it slightly before expiry means that clients whose clock is slightly ahead and therefore have expired their local copy won't refetch the same exact key with a really short remaining time to live

finally, if the RwLock inside of the array, then the two locks should generally not be in contention either (assuming they aren't on the same cache line, which may need some alignment pragma), since the rwlock on the stale key should only be taken after requests involving that keyid kinda die out. this means the loop can freely hold the rwlock while it does IO without introducing any jitter to requests that wouldn't have generated an error anyway

let mut iter = configs.into_iter();
let (current_cfg, current_mtime) = iter.next().expect("non-empty");
let current_age =
std::time::SystemTime::now().duration_since(current_mtime).unwrap_or_default();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should expect() not default, a negative duration indicates the system clock was reset

@zealsham zealsham force-pushed the key-rotation branch 2 times, most recently from fbac638 to add6ced Compare April 3, 2026 19:07
This pr addresses payjoin#445. It implements OHTTP-key rotation to payjoin-mailroom
Mailroom operators can now decide the time interval for keys to be rotated.
Also if a key has expired, a 422 error is returned to clients. Clients can
handle they key-rotation via the cach-control header returned by the directory.

impl KeyRotatingServer {
pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, key identifiers are an arbitrary u8: https://www.ietf.org/rfc/rfc9458.html#section-3.1

so probably easier to use key ids 0 and 1 instead of 1 and 2? i think i said 1 and 2 somewhere so sorry if i made it seem like that was an important detail

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed. Ohttp targets can set whatever key id they want

pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
let mut keys = [None, None];
keys[(key_id - 1) as usize] = Some(server);
Copy link
Copy Markdown
Contributor

@nothingmuch nothingmuch Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be simpler to just initialize both and avoid the Option IMO, no need to pattern match or expect everywhere

btw that's also why i suggested changing "created_at" to "valid_from", since both are created at the same time but only one is valid from Instant::now()

assert!(id == 1 || id == 2, "key_id must be 1 or 2");
keys[(id - 1) as usize] = Some(server);
}
let created_at = Instant::now().checked_sub(current_key_age).unwrap_or_else(Instant::now);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should .expect(), not `.unwrap_or_else. if there's a key claimed to be from the future it's best to crash the server because the host machine is clearly misconfigured

Comment on lines +113 to +114
self.current_key_id = new_key_id;
self.current_key_created_at = Instant::now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.current_key_id = new_key_id;
self.current_key_created_at = Instant::now();

best if this rotates the next key but doesn't yet activate it, see my previous comments discussing why we want two keys and why we want a grace period, if clients whose clock is running a bit fast expire their local key they will request the same key repeatedly, so it's better to already offer the next key (and decapsulate from it) a few seconds before the current key expires, and to still accept an expired key a few seconds after expiry, but then after this grace period already generate the new key to invalidate the old one, removing the need for retire as well

}
};
let _ = tokio::fs::remove_file(keys_dir.join(format!("{new_key_id}.ikm"))).await;
if let Err(e) = crate::key_config::persist_key_config(&config, &keys_dir) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO, especially blocking should not be done while holding a write lock on the whole keyset, since that prevents requests from being handled

see also my suggestion to use a Arc<[Box<RwLock<Server>; 2]> inside of keyset as that eliminates this concern, i think it's perfectly fine to write the key while holding a lock there, but persist_key_config should be async

Copy link
Copy Markdown
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple unaddressed comments from the previous review. In general the hardcoded key ids and array indexing seem like a footgun. See my comment about creating two fields in the key config struct.

Edit: A fixed sized array with indexing is fine. However, i would still like to understand why the fields are optional. The primary key should never be None FWIU, and the secondary key should always be present as well -- only used during the grace period. @zealsham

// clients; both slots are accepted for decapsulation so that clients
// with a cached previous key still work during the overlap window.
#[derive(Debug)]
pub struct KeyRotatingServer {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit. I would suggest a more descriptive name e.g RotatingKeyConfig.


impl KeyRotatingServer {
pub fn from_single(server: ohttp::Server, key_id: u8) -> Self {
assert!(key_id == 1 || key_id == 2, "key_id must be 1 or 2");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed. Ohttp targets can set whatever key id they want

// with a cached previous key still work during the overlap window.
#[derive(Debug)]
pub struct KeyRotatingServer {
keys: [Option<ohttp::Server>; 2],
Copy link
Copy Markdown
Collaborator

@arminsabouri arminsabouri Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever have more than 2 active keys? If not, I would suggest having two fields in this struct instead of indexing

{
current_key: ohttp::Server,
previous_key: Option<ohttp::Server>,
current_key_created_at: Instant,
}

This at least ensures that the primary key is always Some. And may obviate the need for key ids to be in {1,2}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , we would have two key for a short while , the grace period of the previous key before it is retired

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. However, it seems odd to me that they are optional. Shouldn't the current key always be Some?

None
};
Ok(crate::directory::Service::new(db, ohttp_config.into(), sentinel_tag, v1))
let svc =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary abbreviation

Suggested change
let svc =
let service =

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants