Skip to content

Conversation

@vazois
Copy link
Collaborator

@vazois vazois commented Oct 16, 2025

This PR fixes couple of issues:

  1. Our existing transaction replay mechanism keeps track of the operations of a given transaction in a dictionary indexed by sessionID. When a replica encounters a checkpoint marker, it will start buffering the transaction commit markers so it can replay them after encountering the CheckpointEnd marker.
    In that situation, our current solution might intermingle operations between two different transactions from the same session.
  2. When replaying the individual operations of a transaction at the replica, there is no locking. In that case, the results of the operations become immediately visible which not desirable given that a reader might see partial results.

In this PR, I am adding the following classes:

  • TransactionGroup
    Keeps track of the list of transaction operations
  • AofReplayContext
    Keeps track of the replay context and associated objects/buffers used to replay individual operations and transactions.
  • AofReplayCoordinator
    Implements the mechanism to keep track of active transactions, gather transaction operations and coordinate transaction replay.

@vazois vazois force-pushed the vazois/aofprocessor-txn branch from 233a1f0 to a34033c Compare October 16, 2025 03:50
@vazois vazois marked this pull request as ready for review November 14, 2025 23:49
Copilot AI review requested due to automatic review settings November 14, 2025 23:49
Copilot finished reviewing on behalf of vazois November 14, 2025 23:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes transaction replay issues during fuzzy region processing in the AOF (Append-Only File) recovery mechanism. The main problem addressed is that the previous implementation could intermingle operations from different transactions with the same session ID, and lacked proper locking during replay which could expose partial transaction results to readers.

Key changes:

  • Introduced three new classes (TransactionGroup, AofReplayContext, AofReplayCoordinator) to properly track and coordinate transaction replay operations
  • Refactored AOF recovery logic to use the new coordinator pattern with proper transaction isolation
  • Updated test infrastructure to support testing both main store and object store replication scenarios

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
libs/server/AOF/ReplayCoordinator/TransactionGroup.cs New class to track operations for individual transactions
libs/server/AOF/ReplayCoordinator/AofReplayContext.cs New class to maintain replay state including fuzzy region buffers and active transactions
libs/server/AOF/ReplayCoordinator/AofReplayCoordinator.cs New coordinator class implementing transaction tracking, buffering, and replay logic with proper locking
libs/server/AOF/AofProcessor.cs Refactored to use the new coordinator pattern; simplified recovery logic and removed inline transaction handling
test/Garnet.test.cluster/ClusterTestContext.cs Added helper methods SimplePopulateDB and SimpleValidateDB to streamline test database operations
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Updated tests to use new helper methods and support both object store and main store validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants