You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+76-1Lines changed: 76 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,7 @@ Kafka connector for [go-pq-cdc](https://github.com/Trendyol/go-pq-cdc).
10
10
11
11
-**Optimized for Speed and Efficiency**: Minimal resource consumption and faster processing, designed to handle high-throughput data replication.
12
12
-**Real-Time Data Streaming**: Streams data directly from PostgreSQL to Kafka, ensuring up-to-date synchronization across systems.
13
+
-**Initial Snapshot Support**: Capture existing data before starting CDC, ensuring downstream systems receive both historical and real-time data.
13
14
-**Automatic Failover**: In the event of a failure, `go-pq-cdc-kafka` can quickly recover and resume data replication.
14
15
-**Concurrency**: Built with Go's concurrency model (goroutines and channels), ensuring lightweight and highly performant parallel operations.
15
16
@@ -38,8 +39,63 @@ The `go-pq-cdc-kafka` ensures high availability with passive/active modes for Po
38
39
39
40
-**Passive Mode**: If the replication slot becomes inactive, it automatically captures the slot and resumes data streaming. Additionally, other deployments monitor the slot’s status, ensuring redundancy and failover capabilities.
40
41
41
-
This architecture guarantees minimal downtime and continuous data synchronization, even in the event of failure. Additionally, Go’s faster cold starts provide quicker recovery times compared to Debezium, further minimizing potential downtime.
42
+
This architecture guarantees minimal downtime and continuous data synchronization, even in the event of failure. Additionally, Go's faster cold starts provide quicker recovery times compared to Debezium, further minimizing potential downtime.
42
43
44
+
## 📸 NEW: Snapshot Feature
45
+
46
+
**Capture existing data before starting CDC!** The snapshot feature enables initial data synchronization, ensuring downstream systems (Kafka) receive both historical and real-time data.
47
+
48
+
✨ **Key Highlights:**
49
+
50
+
-**Zero Data Loss**: Consistent point-in-time snapshot using PostgreSQL's `pg_export_snapshot()`
51
+
-**Chunk-Based Processing**: Memory-efficient processing of large tables
52
+
-**Multi-Instance Support**: Parallel processing across multiple instances for faster snapshots
53
+
-**Crash Recovery**: Automatic resume from failures with chunk-level tracking
54
+
-**No Duplicates**: Seamless transition from snapshot to CDC mode
55
+
-**Flexible Modes**: Choose between `initial`, `never`, or `snapshot_only` based on your needs
|`cdc.slot.createIfNotExists`| bool | no | - | Create replication slot if not exists. Otherwise, return `replication slot is not exists` error. ||
183
240
|`cdc.slot.name`| string | yes | - | Set the logical replication slot name | Should be unique and descriptive. |
184
241
|`cdc.slot.slotActivityCheckerInterval`| int | yes | 1000 | Set the slot activity check interval time in milliseconds | Specify as an integer value in milliseconds (e.g., `1000` for 1 second). |
242
+
|`cdc.snapshot.enabled`| bool | no | false | Enable initial snapshot feature | When enabled, captures existing data before starting CDC. |
243
+
|`cdc.snapshot.mode`| string | no | never | Snapshot mode: `initial`, `never`, or `snapshot_only`|**initial:** Take snapshot only if no previous snapshot exists, then start CDC. <br> **never:** Skip snapshot, start CDC immediately. <br> **snapshot_only:** Take snapshot and exit (no CDC). |
244
+
|`cdc.snapshot.chunkSize`| int64 | no | 8000 | Number of rows per chunk during snapshot | Adjust based on table size. Larger chunks = fewer chunks but more memory per chunk. |
245
+
|`cdc.snapshot.claimTimeout`| time.Duration | no | 30s | Timeout to reclaim stale chunks | If a worker doesn't send heartbeat for this duration, chunk is reclaimed by another worker. |
246
+
|`cdc.snapshot.heartbeatInterval`| time.Duration | no | 5s | Interval for worker heartbeat updates | Workers send heartbeat every N seconds to indicate they're processing a chunk. |
247
+
|`cdc.snapshot.instanceId`| string | no | auto | Custom instance identifier (optional) | Auto-generated as hostname-pid if not specified. Useful for tracking workers in multi-instance scenarios. |
248
+
|`cdc.snapshot.tables`|[]Table | no*| - | Tables to snapshot (required for `snapshot_only` mode, optional for `initial` mode) |**snapshot_only:** Must be specified here (independent from publication). <br> **initial:** If specified, must be a subset of publication tables. If not specified, all publication tables are snapshotted. |
185
249
|`kafka.tableTopicMapping`| map[string]string | yes | - | Mapping of PostgreSQL table events to Kafka topics | Maps table names to Kafka topics. |
186
250
|`kafka.brokers`|[]string | yes | - | Broker IP and port information ||
187
251
|`kafka.producerBatchSize`| integer | no | 2000 | Maximum message count for batch, if exceeded, flush will be triggered. ||
@@ -223,6 +287,17 @@ the `/metrics` endpoint.
223
287
| go_pq_cdc_kafka_write_total | The total number of successful in write operation to kafka. | slot_name, host, topic_name | Counter |
224
288
| go_pq_cdc_kafka_err_total | The total number of unsuccessful in write operation to kafka. | slot_name, host, topic_name | Counter |
225
289
290
+
### Snapshot Metrics
291
+
292
+
| Metric Name | Description | Labels | Value Type |
0 commit comments