Search code examples
apache-kafkaflushdebeziumconnectorwal

Debezium for Postgress and WAL flush


I have Debezium in a container, capturing all changes of PostgeSQL database records. But i am unable to understand couple of things regarding how Debezium works. If Debezium starts for the first time it takes a snapshot of the database and then starting streaming based on WAL file accordantly to the Debezium documentation.:

PostgreSQL normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the PostgreSQL connector first connects to a particular PostgreSQL database, it starts by performing a consistent snapshot of each of the database schemas. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken. The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off. This includes snapshots. If the connector stops during a snapshot, the connector begins a new snapshot when it restarts.

But there is gap here, or maybe not. When the snapshot of the database is complete and then it is streaming from WAL file, if the connector goes down and until it goes up the WAL is purged/flushed, how Debezium ensuring data integrity?


Solution

  • Since Debezium is using a replication slot on PG, Debezium constantly sends back to PG the information up to which LSN it has consumed so that PG can flush the WAL files. Now, this information is stored in the table pg_replication_slots. When Debezium starts up again, it reads the restart_lsn from that table and requests changes that happened before that value. This is how Debezium is ensuring data integrity. Not that if for some reason, that LSN is not available in the WAL files, there is no way to get it back, meaning data loss has happened.