-
Notifications
You must be signed in to change notification settings - Fork 163
Add more WAL metrics #3685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more WAL metrics #3685
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds extensive WAL (Write-Ahead Log) metrics to improve visibility into PostgreSQL replication slot behavior, particularly during reconnection scenarios. The new metrics help identify bottlenecks across different replication phases and provide insights into logical decoding memory usage and disk spilling.
Key changes:
- Adds 15+ new metrics tracking LSN positions, deltas between replication phases, and WAL sender state
- Introduces monitoring for logical decoding work memory configuration and spill statistics (PG16+)
- Enhances slot monitoring with wait event tracking and safe WAL size indicators
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| protos/route.proto | Adds 15 new fields to SlotInfo message for enhanced WAL metrics |
| nexus/catalog/migrations/V49__enhanced_slot_metrics.sql | Creates database migration to add new metric columns to peer_slot_size table |
| flow/otel_metrics/otel_manager.go | Defines and initializes new OpenTelemetry gauge metrics for WAL monitoring |
| flow/otel_metrics/attributes.go | Adds new attribute keys for wait events, backend state, and WAL status |
| flow/connectors/utils/monitoring/monitoring.go | Updates AppendSlotSizeInfo to persist all new metric fields |
| flow/connectors/postgres/postgres.go | Records new metrics in HandleSlotInfo and removes nil-check guards |
| flow/connectors/postgres/client.go | Rewrites getSlotInfo query with JOINs to collect comprehensive slot and replication statistics |
| flow/activities/flowable.go | Wires new metric gauges into the FlowableActivity's RecordSlotSizes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
Q: anything preventing us from querying this every minute instead of 5? 5 is kinda close to the unhappy zone for the customers |
❌ 2 Tests Failed:
View the full list of 2 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Adding a bunch of WAL metrics in hopes of better seeing the impact of deployment on a busy instance.
pg_stat_activity.wait_event_type/wait_eventfor our backend: these are really the source of truth on when the wal sender becomes idle (wait_event_type=Client, wait_event= WalSenderWaitForWAL) because there's nothing to send (even if the current lsn is ahead of what was sent to us because of an unfinished transaction). However, it also can change many times a second and polling once every few minutes is pretty sparse. Putting it in in case it's still useful.Commit LSN on the receiver - the latest we've received from Postgres so far
pg_stat_replication.sent_lsn- we don't always have permission to read this one but saw some discrepancies with commit LSN so keeping as optional for debugging purposesCurrent LSN - to diff with commit LSN
Deltas between PG-side LSNs, just for convenience
pg_stat_replication_slots.spill_txns/spill_count/spill_bytes(PG16+),logical_decoding_work_mem(PG13+) - to see what the PG is doing after reconnect and before it sent anything to us, and be able to diagnose bad settingMisc quality of life metrics:
pg_replication_slots.safe_wal_size(PG13+) - monitor danger zone when we're behind or there's a burstpg_replication_slots.active/wal_status,pg_stat_activity.state