Search code examples
jsonpostgresqlapache-kafkastoragedebezium

Debezium transforms 1GB database to 100GBs topic storage


I have Debezium in a container, capturing all changes of PostgeSQL database records.

PostgeSQL database is around 1GB having 1thousand tables. On the other side, Debezium is configured to capture all table changes and it's storage is around 100GB after initial load.

I understand that there will be an overhead from conversion to JSON but the difference is multiple times bigger.

Is there anything which can be configured to reduce kafka topic storage?


Solution

  • You can consider single message transformation (SMT) to reduce the size of topic messages, just adding the SMT configuration details to your connector’s configuration:

    transforms=unwrap,...
    transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
    

    See the documentation:

    A Debezium data change event has a complex structure that provides a wealth of information. Kafka records that convey Debezium change events contain all of this information. However, parts of a Kafka ecosystem might expect Kafka records that provide a flat structure of field names and values. To provide this kind of record, Debezium provides the event flattening single message transformation (SMT). Configure this transformation when consumers need Kafka records that have a format that is simpler than Kafka records that contain Debezium change events.

    At the same time Kafka supports compression at topic-level, so you can specify connector's configuration for the default topic compression as part of default topic creation group.

    See the documentation:

    topic.creation.default.compression.type is mapped to the compression.type property of the topic level configuration parameters and defines how messages are compressed on hard disk.