Search code examples
apache-kafkadebeziumcdc

Debezium signal table at connector startup


I have a doubt regarding Ad hoc snapshot signals in debezium

Let say that I have created the signal table in my database before creating my CDC connector:

-- Creating the signal table
CREATE TABLE debezium_signal (id VARCHAR(42) PRIMARY KEY, type VARCHAR(32) NOT NULL, data VARCHAR(2048) NULL);

-- Starting a snapshot
INSERT INTO debezium_signal (id, type, data)
VALUES ('d139b9b7-7777-4547-917d-e1775ea61d41', 'execute-snapshot', '{"data-collections": ["servicedb.customers"]}')

Now, After having this table created and populated, I create my CDC connector with this config:

{
    "name": "my_awesome_cdc_connector",
    "database.dbname": "servicedb",
    "table.include.list": "customers",
    "topic.prefix": "servicedb",
    "snapshot.mode": "schema_only",
    "snapshot.locking.mode": "none",
    "signal.enabled.channels": "source",
    "signal.data.collection": "servicedb.debezium_signal",
    "incremental.snapshot.allow.schema.changes": "true",
    "incremental.snapshot.chunk.size": 1024,
    "tombstones.on.delete": "true",
    "database.ssl.mode": "preferred",
    "poll.interval.ms": "1000",
    "max.batch.size": "1000",
    "output.data.format": "AVRO",
    "tasks.max": "1",
    "status": "RUNNING"
}

My question is:

As I already have an entry in my debezium_signal table, When the CDC connector starts running, will it start my incremental snapshot?

Or will it only consider the signals sent after the connector creation?


Solution

  • As I already have an entry in my debezium_signal table, When the CDC connector starts running, will it start my incremental snapshot?
    
    Or will it only consider the signals sent after the connector creation?
    

    CDC will not be triggered for existing data but only for data inserted into the table after you have created the CDC configuration and run it.

    But you can still re-trigger CDC for old data which already existed in the table by updating a column with the same value.

    Ex:-

    UPDATE debezium_signal SET type = 'execute-snapshot' WHERE id = 'd139b9b7-7777-4547-917d-e1775ea61d41'
    

    This will not do any change to the existing data but will have an entry WAL of the database which will be used by the CDC to re-trigger.

    But I would strongly recommend to add a timestamp column to the table which has CDC configured and to move the data column to a child table which is which has the id column and data column along and id is equal to id in the parent (debezium_signal) table.

    We have seen in Production issue arising with some CDC events are not being triggered when the update/insert rate in high due to large data like your data column.

    You can always query the data column from child table when the CDC event is fired.