Skip to content

fix(core): defer MLS epoch recovery until notification replay finishes [WPB-21916]#21390

Open
thisisamir98 wants to merge 7 commits into
devfrom
WPB-21916-epoch-fix
Open

fix(core): defer MLS epoch recovery until notification replay finishes [WPB-21916]#21390
thisisamir98 wants to merge 7 commits into
devfrom
WPB-21916-epoch-fix

Conversation

@thisisamir98
Copy link
Copy Markdown
Collaborator

@thisisamir98 thisisamir98 commented May 27, 2026

BugWPB-21916 [Web/Desktop] : Missing messages in Web - Prod

Problem
After login, the client replays a large backlog of legacy notifications while local MLS state may already be ahead, behind, or incomplete. When decrypting an MLS message-add failed with WrongEpoch, the recovery orchestrator ran an external-commit rejoin immediately. That could happen in the middle of replay, before later notifications had a chance to apply missing commits.

This led to unnecessary MLSConversationRecovered system messages and made replay noisier. Pausing the rejoin queue did not help, because orchestrator recovery bypassed that queue entirely.

Approach

  • Defer epoch-mismatch recovery for inbound handleMessageAdd while the connection is not LIVE. The failed message returns null and replay continues.
  • After catch-up finishes, runDeferredEpochRecovery() reconciles any conversations that still need external commit (once per conversation).

Problem
After login, the client replays a large backlog of legacy notifications while
local MLS state may already be ahead, behind, or incomplete. When decrypting
an MLS message-add failed with WrongEpoch, the recovery orchestrator ran an
external-commit rejoin immediately. That could happen in the middle of replay,
before later notifications had a chance to apply missing commits.

This led to unnecessary MLSConversationRecovered system messages and made
replay noisier. Pausing the rejoin queue did not help, because orchestrator
recovery bypassed that queue entirely.

Approach
- Defer epoch-mismatch recovery for inbound handleMessageAdd while the
  connection is not LIVE. The failed message returns null and replay continues.
- After catch-up finishes, runDeferredEpochRecovery() reconciles any
  conversations that still need external commit (once per conversation).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🔗 Download Full Report Artifact

🧪 Playwright Test Summary

  • Passed: 15
  • Failed: 0
  • Skipped: 0
  • 🔁 Flaky: 0
  • 📊 Total: 15
  • Total Runtime: 105.7s (~ 1 min 46 sec)

PR checkout only included the head commit, so nx format:check could not
resolve the dev base ref and logged a misleading git error before falling
back to checking the whole repo. Fetch the base branch in CI so format
failures are reported clearly.
@thisisamir98 thisisamir98 force-pushed the WPB-21916-epoch-fix branch from f952354 to 5a0c6d2 Compare May 27, 2026 10:17
Comment thread .github/workflows/ci.yml Fixed
Comment thread .github/workflows/ci.yml Fixed
@sonarqubecloud
Copy link
Copy Markdown


for (const {conversationId, subconvId, trigger} of entries) {
try {
await this.recoverMLSGroupFromEpochMismatch(conversationId, subconvId, trigger, {force: true});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If entries are already cleared and after that if for some reason recoverMLSGroupFromEpochMismatch fails then there will be no way to retry the entry.

What do you think?

const firstEventPayload = notification.data.event.payload[0];
const notificationTime = firstEventPayload ? this.getNotificationEventTime(firstEventPayload) : null;
if (this.connectionState !== ConnectionState.LIVE && notificationTime !== null && notificationTime.length > 0) {
if (!this.connectionStateTracker.isLive() && notificationTime !== null && notificationTime.length > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!this.connectionStateTracker.isLive() && notificationTime !== null && notificationTime.length > 0) {
if (!this.connectionStateTracker.isLive() && is.nonEmptyString(notificationTime)) {

.push(async () => {
this.logger.info(`Resuming message sending. ${getQueueLength()} messages to be sent`);
await this.rehydrateMlsPendingProposalsTasksOnLiveTransition();
await this.service!.conversation.runDeferredEpochRecovery();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need non-null assertion here?

trigger?: MlsEpochRecoveryTrigger,
options: {force: boolean} = {force: false},
) {
if (options.force === false && this.shouldDeferEpochRecovery(trigger)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If option.force is boolean (and controlled by us) the please rename to isForced

Suggested change
if (options.force === false && this.shouldDeferEpochRecovery(trigger)) {
if (!options.force && this.shouldDeferEpochRecovery(trigger)) {

}

private getDeferredEpochRecoveryKey(conversationId: QualifiedId, subconvId?: SUBCONVERSATION_ID): string {
return `${conversationId.id}@${conversationId.domain}:${subconvId ?? 'none'}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this none can be provided by any enum?

* notification replay. Called when the connection transitions to LIVE.
*/
public async runDeferredEpochRecovery(): Promise<void> {
if (this.deferredEpochRecoveries.size === 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use is.emptySet(deferredEpochRecoveries)

options: {force: boolean} = {force: false},
) {
if (options.force === false && this.shouldDeferEpochRecovery(trigger)) {
this.deferEpochRecovery(conversationId, subconversationId, trigger);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are throwing an error from deferEpochRecovery, are we handling it somewhere?

private readonly subconversationService: SubconversationService,
private readonly isMLSConversationRecoveryEnabled: () => Promise<boolean>,
private readonly _mlsService?: MLSService,
private readonly connectionStateTracker: ConnectionStateTracker = createConnectionStateTracker(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we maintaining two state-managers for connection?
I can see another one inlibraries/core/src/account.ts. These states will not be synced, is this intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants