Skip to content

SPOC-484: TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only dry-run logging#400

Open
danolivo wants to merge 4 commits intomainfrom
spoc-484
Open

SPOC-484: TRANSDISCARD/SUB_DISABLE: switch from subtransaction replay to read-only dry-run logging#400
danolivo wants to merge 4 commits intomainfrom
spoc-484

Conversation

@danolivo
Copy link
Copy Markdown
Contributor

  • In TRANSDISCARD and SUB_DISABLE modes, replace per-DML subtransaction replay with a single read-only transaction. Each DML handler logs a row to spock.exception_log instead of attempting the operation. Only the record that originally caused the error carries the real error message; other records get an empty string.
  • Identify the failed record by xact_action_counter (saved as failed_action in SpockExceptionLog), which is unique per DML within a transaction.
  • In DISCARD mode, each DML still runs inside a subtransaction; only the failed ones are logged.
  • Add dry_run_logging regression test covering all three exception modes with different breakage methods (absent table, truncated table, deleted row) and an INSERT_EXISTS conflict to verify that dry-run modes leave spock.resolutions empty while DISCARD populates it.

…-run replay

In TRANSDISCARD and SUB_DISABLE modes, the second (retry) pass replays
the transaction as read-only.  Each DML handler logs a row to
spock.exception_log instead of attempting the operation.  The record
that originally caused the error carries the real error message; other
records get an empty string.

The failed record is identified by xact_action_counter (saved as
failed_action in SpockExceptionLog), which is unique per DML within a
transaction.

In DISCARD mode, each DML still runs inside a subtransaction; only the
failed ones are logged.

Add dry_run_logging regression test covering TRANSDISCARD (absent
table), DISCARD (truncated table), and SUB_DISABLE (deleted row), with
INSERT_EXISTS conflict on drl_t1 to verify that dry-run modes leave
spock.resolutions empty while DISCARD populates it.
@danolivo danolivo self-assigned this Mar 24, 2026
@danolivo danolivo added the enhancement New feature or request label Mar 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 204bdcbf-599f-4ff0-b14e-397097d39404

📥 Commits

Reviewing files that changed from the base of the PR and between 7f4beea and a069f87.

📒 Files selected for processing (1)
  • src/spock_apply.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/spock_apply.c

📝 Walkthrough

Walkthrough

Adds failed_action to exception logs, treats NULL error messages as empty strings, enforces read-only replay for TRANSDISCARD/SUB_DISABLE retry paths while logging skipped operations, simplifies commit/disable control flow, prevents LSN advancement on SUB_DISABLE retries, adds a new regression test, and includes the test in Makefile REGRESS.

Changes

Cohort / File(s) Summary
Exception handler declarations
include/spock_exception_handler.h
Added uint32 failed_action to SpockExceptionLog to record the xact action counter at the time of an error.
Exception logging behavior
src/spock_exception_handler.c
NULL error_message now stored as an empty string ("") in spock.exception_log.error_message; related comment removed.
Apply worker / replay logic
src/.../spock_apply.c
When use_try_block and exception_behaviour is TRANSDISCARD or SUB_DISABLE, replay transactions are made read-only and DML/SQL execution is skipped while logging discarded operations; commit/disable control flow simplified (no parent transaction abort/restart); SUB_DISABLE retry paths block LSN advancement; failed_action used to attach real error messages to the originating action.
New and updated regression tests
tests/regress/sql/dry_run_logging.sql, tests/regress/sql/replication_set.sql
Added dry_run_logging.sql covering transdiscard/discard/sub_disable scenarios, exception_log/resolutions checks, and subscription-disable validation; replication_set.sql now truncates spock.exception_log before runs and adds extra exception-log validation selects.
Build/test config
Makefile
Added dry_run_logging to the REGRESS test list.

Poem

🐇 I hopped through lines where actions failed,
Marked the step where the real fault hailed.
I skipped the write but kept the tale,
Logged the action so replies won't bail.
Thump — the ledger holds each tiny trail.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: switching from subtransaction replay to read-only dry-run logging for TRANSDISCARD/SUB_DISABLE modes, which is clearly reflected across the modified files.
Description check ✅ Passed The description is directly related to the changeset, explaining the approach for TRANSDISCARD/SUB_DISABLE modes (read-only transactions with logging instead of subtransactions), the failed_action tracking mechanism, DISCARD behavior, and the new regression test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch spoc-484

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as outdated.

…producer

Replace wait_slot_confirm_lsn with sync_event/wait_for_sync_event to
eliminate replication race conditions when querying exception_log on the
subscriber.

Add a TRANSDISCARD sub-test that mixes DML with a replicate_ddl call
inside a single failing transaction.  This reproduces a duplicate DDL
logging bug in handle_queued_ddl detected by CodeRabbit during review:
the TRANSDISCARD/SUB_DISABLE branch logs via add_entry_to_exception_log,
then falls through to the generic logging block which logs again.

Remove test section numbering per project conventions.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/regress/sql/dry_run_logging.sql`:
- Around line 107-110: Update the test comment that currently reads "Expect
entries with duplicated DDL record" to reflect the intended behavior of no
duplicate DDL logging; modify the comment above the SELECT querying
spock.exception_log (the query using table_name, operation, (error_message <>
'') AS has_error, ddl_statement ordered by command_counter) to state that the
test expects NO duplicate DDL entries (i.e., a single DDL record per command).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc49989d-5676-4a35-890d-8275efe21d33

📥 Commits

Reviewing files that changed from the base of the PR and between f359f4d and 936e414.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/regress/sql/dry_run_logging.sql

…ISABLE

The TRANSDISCARD/SUB_DISABLE branch logged via add_entry_to_exception_log
then fell through to the generic logging block which called
should_log_exception and logged again, producing duplicate DDL entries.

Merge both call sites into a single add_entry_to_exception_log after the
if/else branches.  The TRANSDISCARD/SUB_DISABLE branch now only extracts
the SQL text; actual logging is done in the shared block.

Add Assert(my_exception_log_index >= 0) before accessing exception_log_ptr
since use_try_block guarantees the index was set in handle_begin.

Bug reported by CodeRabbit during review.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/spock_apply.c (3)

1287-1308: Consider adding defensive assertion before accessing exception_log_ptr.

The code at line 2302 includes Assert(my_exception_log_index >= 0) before accessing exception_log_ptr, but the DML handlers (INSERT, UPDATE, DELETE) don't have this check. While my_exception_log_index should be valid when use_try_block is true (set during handle_begin), adding the assertion would provide consistent defensive validation.

🛡️ Suggested defensive assertion
 		if (exception_behaviour == TRANSDISCARD ||
 			exception_behaviour == SUB_DISABLE)
 		{
+			Assert(my_exception_log_index >= 0);
+
 			/*
 			 * TRANSDISCARD and SUB_DISABLE: skip the DML and log the
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 1287 - 1308, Add a defensive Assert that
my_exception_log_index is valid before dereferencing exception_log_ptr in the
DML handlers (the INSERT/UPDATE/DELETE branches where exception_log_ptr[...] is
accessed and log_insert_exception(...) is called). Specifically, in the block
using exception_log_ptr[my_exception_log_index] (and before setting
exception_log_ptr[my_exception_log_index].local_tuple or reading
.initial_error_message), insert Assert(my_exception_log_index >= 0) to mirror
the check used in handle_begin/other code paths so the handlers (where
use_try_block is expected) validate the index before use.

1299-1304: Optional: Extract repeated error_msg pattern to helper function.

The same logic for computing error_msg is repeated in four handlers (INSERT, UPDATE, DELETE, SQL):

char *error_msg =
    (xact_action_counter ==
     exception_log_ptr[my_exception_log_index].failed_action &&
     exception_log_ptr[my_exception_log_index].initial_error_message[0]) ?
    exception_log_ptr[my_exception_log_index].initial_error_message :
    NULL;

This could be extracted to a helper function to improve maintainability and ensure consistent behavior if the logic ever needs to change.

♻️ Suggested helper function
/*
 * Get the error message to log for the current action during dry-run replay.
 * Returns the initial error message only for the action that caused the
 * original failure; returns NULL for all other actions.
 */
static inline char *
get_dry_run_error_msg(void)
{
    Assert(my_exception_log_index >= 0);
    
    if (xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action &&
        exception_log_ptr[my_exception_log_index].initial_error_message[0] != '\0')
        return exception_log_ptr[my_exception_log_index].initial_error_message;
    
    return NULL;
}

Also applies to: 1477-1482, 1594-1599, 2307-2312

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 1299 - 1304, Extract the repeated error_msg
computation into a small static helper (e.g. static inline char
*get_dry_run_error_msg(void)) that reads my_exception_log_index,
xact_action_counter and exception_log_ptr and returns
exception_log_ptr[my_exception_log_index].initial_error_message when
xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action
and the initial_error_message is non-empty, otherwise returns NULL; replace the
duplicated ternary in the four handlers (the blocks around the places that set
error_msg in INSERT/UPDATE/DELETE/SQL handlers) with a call to
get_dry_run_error_msg() and keep the Assert(my_exception_log_index >= 0) in the
helper.

871-878: Consider clearing initial_error_message for consistency.

Lines 840-843 (TRANSDISCARD path) clear initial_error_message[0] = '\0', but lines 871-876 (SUB_DISABLE path) do not. While this is harmless since the worker exits after spock_disable_subscription(), adding the clear would maintain consistency between the two paths.

♻️ Suggested change for consistency
 			exception_log = &exception_log_ptr[my_exception_log_index];
 			exception_log->commit_lsn = InvalidXLogRecPtr;
+			exception_log->initial_error_message[0] = '\0';
 			MySpockWorker->restart_delay = 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 871 - 878, In the SUB_DISABLE path add the
same clearing of the saved initial error string as done in the TRANSDISCARD
path: after assigning exception_log = &exception_log_ptr[my_exception_log_index]
and before calling spock_disable_subscription()/elog(ERROR...), set
exception_log->initial_error_message[0] = '\0' so the SpockExceptionLog
(exception_log) is consistent with the TRANSDISCARD handling; keep the rest
(commit_lsn, MySpockWorker->restart_delay, spock_disable_subscription(), elog)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/spock_apply.c`:
- Around line 1287-1308: Add a defensive Assert that my_exception_log_index is
valid before dereferencing exception_log_ptr in the DML handlers (the
INSERT/UPDATE/DELETE branches where exception_log_ptr[...] is accessed and
log_insert_exception(...) is called). Specifically, in the block using
exception_log_ptr[my_exception_log_index] (and before setting
exception_log_ptr[my_exception_log_index].local_tuple or reading
.initial_error_message), insert Assert(my_exception_log_index >= 0) to mirror
the check used in handle_begin/other code paths so the handlers (where
use_try_block is expected) validate the index before use.
- Around line 1299-1304: Extract the repeated error_msg computation into a small
static helper (e.g. static inline char *get_dry_run_error_msg(void)) that reads
my_exception_log_index, xact_action_counter and exception_log_ptr and returns
exception_log_ptr[my_exception_log_index].initial_error_message when
xact_action_counter == exception_log_ptr[my_exception_log_index].failed_action
and the initial_error_message is non-empty, otherwise returns NULL; replace the
duplicated ternary in the four handlers (the blocks around the places that set
error_msg in INSERT/UPDATE/DELETE/SQL handlers) with a call to
get_dry_run_error_msg() and keep the Assert(my_exception_log_index >= 0) in the
helper.
- Around line 871-878: In the SUB_DISABLE path add the same clearing of the
saved initial error string as done in the TRANSDISCARD path: after assigning
exception_log = &exception_log_ptr[my_exception_log_index] and before calling
spock_disable_subscription()/elog(ERROR...), set
exception_log->initial_error_message[0] = '\0' so the SpockExceptionLog
(exception_log) is consistent with the TRANSDISCARD handling; keep the rest
(commit_lsn, MySpockWorker->restart_delay, spock_disable_subscription(), elog)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 031139ab-7009-4644-ab84-545c7b0c6bb2

📥 Commits

Reviewing files that changed from the base of the PR and between 936e414 and 7f4beea.

⛔ Files ignored due to path filters (1)
  • tests/regress/expected/dry_run_logging.out is excluded by !**/*.out
📒 Files selected for processing (2)
  • src/spock_apply.c
  • tests/regress/sql/dry_run_logging.sql
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/regress/sql/dry_run_logging.sql

Also, harden the PG_CATCH block: explicitly switch to ApplyOperationContext
before CopyErrorData so edata lands in a known context, call
FlushErrorState before RollbackAndReleaseCurrentSubTransaction to clear
the error stack, and move ReleaseCurrentSubTransaction into PG_TRY so
errors during release are caught.
*/
if (error_message == NULL)
values[Anum_exception_log_error_message - 1] = CStringGetTextDatum("unknown");
values[Anum_exception_log_error_message - 1] = CStringGetTextDatum("");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid using empty strings because I have seen users get confused between empty strings and NULL values when querying, but "unknown" is also not great. Just adding a comment in case you or anyone has a better suggestion. Perhaps "unavailable"?

Copy link
Copy Markdown
Member

@mason-sharp mason-sharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, it is much better than what we were doing before. I will revisit after the read-only PR.

if (!xact_had_exception ||
if ((!xact_had_exception &&
!(MyApplyWorker->use_try_block &&
exception_behaviour == SUB_DISABLE)) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants