fix(core): defer MLS epoch recovery until notification replay finishes [WPB-21916]#21390
fix(core): defer MLS epoch recovery until notification replay finishes [WPB-21916]#21390thisisamir98 wants to merge 7 commits into
Conversation
Problem After login, the client replays a large backlog of legacy notifications while local MLS state may already be ahead, behind, or incomplete. When decrypting an MLS message-add failed with WrongEpoch, the recovery orchestrator ran an external-commit rejoin immediately. That could happen in the middle of replay, before later notifications had a chance to apply missing commits. This led to unnecessary MLSConversationRecovered system messages and made replay noisier. Pausing the rejoin queue did not help, because orchestrator recovery bypassed that queue entirely. Approach - Defer epoch-mismatch recovery for inbound handleMessageAdd while the connection is not LIVE. The failed message returns null and replay continues. - After catch-up finishes, runDeferredEpochRecovery() reconciles any conversations that still need external commit (once per conversation).
|
🔗 Download Full Report Artifact 🧪 Playwright Test Summary
|
PR checkout only included the head commit, so nx format:check could not resolve the dev base ref and logged a misleading git error before falling back to checking the whole repo. Fetch the base branch in CI so format failures are reported clearly.
f952354 to
5a0c6d2
Compare
Avoid expanding github.base_ref directly in the shell script so zizmor does not flag code injection via template expansion.
|
|
|
||
| for (const {conversationId, subconvId, trigger} of entries) { | ||
| try { | ||
| await this.recoverMLSGroupFromEpochMismatch(conversationId, subconvId, trigger, {force: true}); |
There was a problem hiding this comment.
If entries are already cleared and after that if for some reason recoverMLSGroupFromEpochMismatch fails then there will be no way to retry the entry.
What do you think?
| const firstEventPayload = notification.data.event.payload[0]; | ||
| const notificationTime = firstEventPayload ? this.getNotificationEventTime(firstEventPayload) : null; | ||
| if (this.connectionState !== ConnectionState.LIVE && notificationTime !== null && notificationTime.length > 0) { | ||
| if (!this.connectionStateTracker.isLive() && notificationTime !== null && notificationTime.length > 0) { |
There was a problem hiding this comment.
| if (!this.connectionStateTracker.isLive() && notificationTime !== null && notificationTime.length > 0) { | |
| if (!this.connectionStateTracker.isLive() && is.nonEmptyString(notificationTime)) { |
| .push(async () => { | ||
| this.logger.info(`Resuming message sending. ${getQueueLength()} messages to be sent`); | ||
| await this.rehydrateMlsPendingProposalsTasksOnLiveTransition(); | ||
| await this.service!.conversation.runDeferredEpochRecovery(); |
There was a problem hiding this comment.
Do we need non-null assertion here?
| trigger?: MlsEpochRecoveryTrigger, | ||
| options: {force: boolean} = {force: false}, | ||
| ) { | ||
| if (options.force === false && this.shouldDeferEpochRecovery(trigger)) { |
There was a problem hiding this comment.
If option.force is boolean (and controlled by us) the please rename to isForced
| if (options.force === false && this.shouldDeferEpochRecovery(trigger)) { | |
| if (!options.force && this.shouldDeferEpochRecovery(trigger)) { |
| } | ||
|
|
||
| private getDeferredEpochRecoveryKey(conversationId: QualifiedId, subconvId?: SUBCONVERSATION_ID): string { | ||
| return `${conversationId.id}@${conversationId.domain}:${subconvId ?? 'none'}`; |
There was a problem hiding this comment.
Is this none can be provided by any enum?
| * notification replay. Called when the connection transitions to LIVE. | ||
| */ | ||
| public async runDeferredEpochRecovery(): Promise<void> { | ||
| if (this.deferredEpochRecoveries.size === 0) { |
There was a problem hiding this comment.
I would suggest to use is.emptySet(deferredEpochRecoveries)
| options: {force: boolean} = {force: false}, | ||
| ) { | ||
| if (options.force === false && this.shouldDeferEpochRecovery(trigger)) { | ||
| this.deferEpochRecovery(conversationId, subconversationId, trigger); |
There was a problem hiding this comment.
You are throwing an error from deferEpochRecovery, are we handling it somewhere?
| private readonly subconversationService: SubconversationService, | ||
| private readonly isMLSConversationRecoveryEnabled: () => Promise<boolean>, | ||
| private readonly _mlsService?: MLSService, | ||
| private readonly connectionStateTracker: ConnectionStateTracker = createConnectionStateTracker( |
There was a problem hiding this comment.
Why are we maintaining two state-managers for connection?
I can see another one inlibraries/core/src/account.ts. These states will not be synced, is this intentional?



Problem
After login, the client replays a large backlog of legacy notifications while local MLS state may already be ahead, behind, or incomplete. When decrypting an MLS message-add failed with WrongEpoch, the recovery orchestrator ran an external-commit rejoin immediately. That could happen in the middle of replay, before later notifications had a chance to apply missing commits.
This led to unnecessary MLSConversationRecovered system messages and made replay noisier. Pausing the rejoin queue did not help, because orchestrator recovery bypassed that queue entirely.
Approach