Search code examples
apache-kafkaapache-kafka-connectdebezium

Debezium New Record State Extraction SMT doesn't work properly in case of DELETE


I'm trying to apply Debezium's New Record State Extraction SMT using the following configuration:

"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": true,
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.unwrap.add.fields": "db,schema,table,txId,ts_ms"

For INSERT and UPDATE operations I get the messages as expected, but in case of DELETE I get the following as a payload:

"payload": {
    "id": 2,
    "first_name": "",
    "last_name": "",
    "__db": "postgres",
    "__schema": "schema1",
    "__table": "user_details",
    "__txId": 5145,
    "__ts_ms": 1638760801510,
    "__deleted": "true"
  }

As you can see above, both first_name and last_name fields have empty values, though the record I deleted has non-empty values for both of those fields. What I expect to see as a value for those 2 fields is their value at the moment of deletion as it is shown in debezium's before payload chunk in case when New Record State Extraction SMT is not applied.


Solution

  • The reason of empty values for all columns except the PK is not related to New Record State Extraction SMT at all. For postgres, there is a REPLICA IDENTITY table-level parameter that can be used to control the information written to WAL to identify tuple data that is being deleted or updated.

    This parameter has 4 modes:

    • DEFAULT
    • USING INDEX index
    • FULL
    • NOTHING

    In the case of DEFAULT, old tuple data is only identified with the primary key of the table. Columns that are not part of the primary key do not have their old value written.

    In the case of FULL, all the column values of old tuple are properly written to WAL all the time. Hence, executing the following command for the target table will make the old record values to be properly populated in debezium message:

    ALTER TABLE some_table REPLICA IDENTITY FULL;
    

    NOTE!! FULL is the most verbose, and as well the most resource-consuming mode. Be careful with it particularly for heavily-updated tables.