Skip to content

Conversation

@alco
Copy link
Member

@alco alco commented Oct 13, 2025

This PR introduces periodic monitoring of retained WAL size when Electric has scaled down its database connections due to inactivity. If during one of these periodic checks the retained WAL is detected to have grown beyond the configured threshold, Electric wakes up the connection subsystem to resume replication stream processing.

Core changes

Connection.Restarter

  • When connections scale down, starts a periodic timer to check retained WAL size
  • Queries Postgres for the retained WAL size using pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  • If size exceeds threshold, restarts the connection subsystem
  • Logs checks with human-readable sizes and intervals

Connection.Manager
When it validates connection options, it stores the validated ones in StackConfig. This can then be used by Connection.Restarter for one-off DB queries to check the retained WAL size. When Connection.Manager restarts (after an error or when the connection subsystem is woken up), it erases these options from StackConfig and repeats the validation process as before. This is to prevent the system from ending up in an invalid state by ensuring that at Connection.Manager startup it will always get the same starting connection options to work with.

Added two new configuration options:

  • ELECTRIC_REPLICATION_IDLE_WAL_SIZE_CHECK_PERIOD: How often to check retained WAL size when scaled down (default: 1 hour)
  • ELECTRIC_REPLICATION_IDLE_WAL_SIZE_THRESHOLD: WAL size threshold that triggers reconnection (default: 100 MB)

Both options support human-readable time/size formats.

Refactoring

Introduced a new module name OneOffConnection that wrap Postgrex.SimpleConnection and provides a simple API for opening a one-off DB connection, running a query (using the simple protocol) and getting the result back, all synchronously.

Reimplemented the lock breaking logic using OneOffConnection and removed LockBreakerConnection since it was no longer necessary to have as a separate module.

Refactored ConnectionResolver, replacing its ad-hoc wrapping of Postgrex.SimpleConnection with OneOffConnection.

Testing

Added a new integration test (integration-tests/tests/wal-size-check-while-scaled-down.lux) that verifies Electric's handling of two cases during its periodic WAL size check: 1) WAL size under the threshold; 2) WAL size has exceeded the threshold.


Closes #3260.

@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.76%. Comparing base (de615ef) to head (9d32a33).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3274      +/-   ##
==========================================
- Coverage   87.79%   87.76%   -0.03%     
==========================================
  Files          18       18              
  Lines        1663     1676      +13     
  Branches      420      425       +5     
==========================================
+ Hits         1460     1471      +11     
- Misses        201      203       +2     
  Partials        2        2              
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/typescript-client 93.66% <ø> (-0.10%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 87.76% <ø> (-0.03%) ⬇️
unit-tests 87.76% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@alco alco force-pushed the alco/wal-size-check branch from 35d985e to ca3a175 Compare October 13, 2025 23:01
@alco alco changed the title Add a configuration option to set the WAL check period in scaled down mode Add two new configuration options for periodic retained WAL size check in scaled down mode Oct 13, 2025
@alco alco changed the base branch from main to alco/scaled-down-stack-event October 13, 2025 23:02
@alco alco force-pushed the alco/wal-size-check branch 2 times, most recently from 4730f9c to fd65b33 Compare October 13, 2025 23:19
@alco alco marked this pull request as ready for review October 13, 2025 23:26
Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I assume this does not work since there's no querying of the wal size (and no tests which I assume is for the above reason?)

@alco alco force-pushed the alco/scaled-down-stack-event branch from 92a696c to 38a1a84 Compare October 22, 2025 12:39
Base automatically changed from alco/scaled-down-stack-event to main October 22, 2025 12:54
@alco alco force-pushed the alco/wal-size-check branch from fd65b33 to bd2bd19 Compare November 3, 2025 10:35
@blacksmith-sh

This comment has been minimized.

To avoid the failure state where Electric can no longer connect to the
db without first doing the revalidation
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a configuration option to set the WAL check period in scaled down mode

3 participants