Replace wal_sender_timeout-based liveness with TCP keepalive. by ibrarahmad · Pull Request #373 · pgEdge/spock

ibrarahmad · 2026-03-04T06:54:44Z

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout.

The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control.

Replace the entire mechanism with a clean two-layer model:

TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds.
wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive.
spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable.

Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only.

Remove maybe_send_feedback() as it is no longer needed.

SPOC-419

coderabbitai · 2026-03-04T06:55:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97c781cc-c795-448f-bde8-acdc5a68ac81

📥 Commits

Reviewing files that changed from the base of the PR and between e4c970d and ed05b18.

📒 Files selected for processing (4)

docs/configuring.md
include/spock.h
src/spock.c
src/spock_apply.c

💤 Files with no reviewable changes (1)

docs/configuring.md

🚧 Files skipped from review as they are similar to previous changes (1)

include/spock.h

📝 Walkthrough

Walkthrough

Adds a new apply-worker idle-timeout GUC, exposes it in the header, adjusts replication connection parameters and TCP keepalive defaults, disables server-side wal_sender_timeout for replication connections, and refactors apply-worker timeout and feedback handling (removes internal feedback helper and updates idle-timer behavior).

Changes

Cohort / File(s)	Summary
Header Export `include/spock.h`	Removed `extern int spock_feedback_frequency;` and added `extern int spock_apply_idle_timeout;`.
GUC Configuration & Connection Setup `src/spock.c`	Added `int spock_apply_idle_timeout = 300` and registered `spock.apply_idle_timeout` GUC (seconds). Increased `CONN_PARAM_ARRAY_SIZE` from 9→10, adjusted TCP keepalive defaults (`keepalives_idle` 20→10, `keepalives_interval` 20→5, `keepalives_count` 5→3), and appended `-c wal_sender_timeout=0` to replication connection `options`. Removed registration of `spock.feedback_frequency`.
Apply Worker Timeout & Feedback `src/spock_apply.c`	Removed static `maybe_send_feedback(...)` and its scheduling/trigger logic. apply_work() now uses `spock_apply_idle_timeout` (seconds→ms) for WL_TIMEOUT checks only when >0, updates `last_receive_timestamp` only after actual COPY data (`r > 0`), and no longer forces feedback from the `'w'` message path.
Documentation `docs/configuring.md`	Removed the `spock.feedback_frequency` configuration option documentation and its description.

Poem

🐇 I watched the WALs as twilight crept,
A tiny timeout softly kept,
No more forced pings to break the hush,
I count the seconds, quiet as a brush,
Reconnect, resume, then nuzzle and leap!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: replacing wal_sender_timeout-based liveness detection with TCP keepalive as the primary mechanism.
Description check	✅ Passed	The description clearly relates to the changeset, providing detailed context about why the change was needed and explaining the new two-layer liveness detection model.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch FEED_BACK

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/spock.c (1)
345-356: Consider making TCP keepalive parameters configurable.

The hardcoded values (idle=10s, interval=5s, count=3) result in ~25s dead connection detection. While reasonable for most deployments, high-latency or unreliable network environments may experience false-positive disconnects.

Consider exposing these as GUCs (e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) to allow tuning without code changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock.c` around lines 345 - 356, Replace the hardcoded TCP keepalive
literals in the keys/vals block with configurable GUC-backed values: define GUCs
(e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count)
as int variables (suggest names spock_keepalives_idle,
spock_keepalives_interval, spock_keepalives_count) during module initialization
(e.g., in _PG_init or the existing GUC registration area), register them with
DefineCustomIntVariable, and then use those variables' stringified values when
populating vals[] for the keys "keepalives_idle", "keepalives_interval", and
"keepalives_count" in the code that sets keys[i]/vals[i]; keep the default
values 10/5/3 and ensure bounds checking on the GUCs (positive integers) when
registering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/spock.c`:
- Around line 345-356: Replace the hardcoded TCP keepalive literals in the
keys/vals block with configurable GUC-backed values: define GUCs (e.g.,
spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) as int
variables (suggest names spock_keepalives_idle, spock_keepalives_interval,
spock_keepalives_count) during module initialization (e.g., in _PG_init or the
existing GUC registration area), register them with DefineCustomIntVariable, and
then use those variables' stringified values when populating vals[] for the keys
"keepalives_idle", "keepalives_interval", and "keepalives_count" in the code
that sets keys[i]/vals[i]; keep the default values 10/5/3 and ensure bounds
checking on the GUCs (positive integers) when registering.

ℹ️ Review info

Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69aec04d-1d81-4de7-913f-14b9f06f8761

📥 Commits

Reviewing files that changed from the base of the PR and between 50764f4 and 1b0262a.

📒 Files selected for processing (3)

include/spock.h
src/spock.c
src/spock_apply.c

danolivo · 2026-03-04T09:53:21Z

src/spock_apply.c

+			 * kernel ACKs them, but no data is being sent.
 			 */
-			if (rc & WL_TIMEOUT)
+			if (rc & WL_TIMEOUT && spock_apply_idle_timeout > 0)


It seems like if walsender just doesn't have data to send for a long time, subscriber will restart. Am I wrong?

It would be better to modify walsender little: skip keepalive messages being busy and rely on TCP status. But send keepalive messages if no data arrives from the WAL. In this case we don't need any subscriber-side GUC at all.

mason-sharp · 2026-03-24T20:52:26Z

@ibrarahmad Needs rebase

mason-sharp

Added a comment about having a non-zero wal_sender_timeout.

Also, needs a rebase.

Also, could use that test file.

mason-sharp · 2026-03-29T04:08:18Z

src/spock.c

+	if (replication)
+	{
+		keys[i] = "options";
+		vals[i] = "-c wal_sender_timeout=0";


Maybe this should be non-zero, like 300000. Less?

What if there is a proxy that blocks the sender when it is trying to send WAL? The keepalives sent from the idle receiver won't make a difference. Meanwhile, WAL is accumulating.

coderabbitai

♻️ Duplicate comments (1)

src/spock_apply.c (1)
2852-2863: ⚠️ Potential issue | 🟠 Major

This can reconnect an idle-but-healthy upstream.

last_receive_timestamp only moves when CopyData arrives. Because the replication connection now runs with wal_sender_timeout = 0, PostgreSQL's walsender returns early from WalSndKeepaliveIfNecessary() and won't send protocol keepalives, so a publisher that is simply caught up and idle can still hit this reconnect path after spock.apply_idle_timeout. As written, this safety net can't distinguish “hung” from “idle”; it needs a sender heartbeat that survives wal_sender_timeout = 0, or this GUC should default to 0. (doxygen.postgresql.org)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 2852 - 2863, The idle-timeout check using
last_receive_timestamp can disconnect a healthy but caught-up publisher because
CopyData isn't updated by walsender keepalives when wal_sender_timeout = 0;
update the logic so spock.apply_idle_timeout does not trigger in that case:
either change the GUC spock_apply_idle_timeout default to 0, or add a guard in
the timeout branch (the block that sets MySpockWorker->worker_status =
SPOCK_WORKER_STATUS_STOPPED and elog(ERROR, ...)) to skip reconnect when the
upstream is only sending WAL keepalives (i.e., detect/update a separate
last_keepalive_timestamp on WalSndKeepaliveIfNecessary/keepalive handling and
use that for idle detection), or check wal_sender_timeout and treat
spock_apply_idle_timeout as disabled when wal_sender_timeout == 0; update
references to last_receive_timestamp, CopyData handling, and the reconnect path
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/spock_apply.c`:
- Around line 2852-2863: The idle-timeout check using last_receive_timestamp can
disconnect a healthy but caught-up publisher because CopyData isn't updated by
walsender keepalives when wal_sender_timeout = 0; update the logic so
spock.apply_idle_timeout does not trigger in that case: either change the GUC
spock_apply_idle_timeout default to 0, or add a guard in the timeout branch (the
block that sets MySpockWorker->worker_status = SPOCK_WORKER_STATUS_STOPPED and
elog(ERROR, ...)) to skip reconnect when the upstream is only sending WAL
keepalives (i.e., detect/update a separate last_keepalive_timestamp on
WalSndKeepaliveIfNecessary/keepalive handling and use that for idle detection),
or check wal_sender_timeout and treat spock_apply_idle_timeout as disabled when
wal_sender_timeout == 0; update references to last_receive_timestamp, CopyData
handling, and the reconnect path accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f499ec7-1af5-4d57-bcd9-56189b254fe8

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0262a and e4c970d.

📒 Files selected for processing (3)

include/spock.h
src/spock.c
src/spock_apply.c

✅ Files skipped from review due to trivial changes (1)

include/spock.h

🚧 Files skipped from review as they are similar to previous changes (1)

src/spock.c

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout. The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control. Replace the entire mechanism with a clean two-layer model: - TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds. - wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive. - spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable. Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only. Remove maybe_send_feedback() as it is no longer needed.

mason-sharp · 2026-03-31T18:59:31Z

Did a rebase + one additional commit

codacy-production · 2026-03-31T18:59:55Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -1 duplication

Metric Results

Duplication -1

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

ibrarahmad requested a review from mason-sharp March 4, 2026 06:55

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

danolivo reviewed Mar 4, 2026

View reviewed changes

mason-sharp reviewed Mar 29, 2026

View reviewed changes

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Ibrar Ahmed and others added 2 commits March 31, 2026 11:55

Remove feedback_frequency GUC, not used

ed05b18

mason-sharp force-pushed the FEED_BACK branch from e4c970d to ed05b18 Compare March 31, 2026 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace wal_sender_timeout-based liveness with TCP keepalive.#373

Replace wal_sender_timeout-based liveness with TCP keepalive.#373
ibrarahmad wants to merge 2 commits intomainfrom
FEED_BACK

ibrarahmad commented Mar 4, 2026 •

edited by mason-sharp

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

danolivo Mar 4, 2026

Uh oh!

mason-sharp commented Mar 24, 2026

Uh oh!

mason-sharp left a comment

Uh oh!

mason-sharp Mar 29, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

mason-sharp commented Mar 31, 2026

Uh oh!

codacy-production bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ibrarahmad commented Mar 4, 2026 • edited by mason-sharp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

danolivo Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mason-sharp commented Mar 24, 2026

Uh oh!

mason-sharp left a comment

Choose a reason for hiding this comment

Uh oh!

mason-sharp Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mason-sharp commented Mar 31, 2026

Uh oh!

codacy-production bot commented Mar 31, 2026

Up to standards ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ibrarahmad commented Mar 4, 2026 •

edited by mason-sharp

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading