postgresql apache-kafka apache-kafka-connect debezium

Postgres replication slot from kafka-connect is filling up

The replication slot created by a kafka-connector connector is filling up.

I have a postgres RDS database on AWS. I put the following parameter group option on it (only showing the diff from default)

rds.logical_replication: 1

I have kafka connect running with a debezium postgres connector. This is the config (with certain values redacted, of course)

"database.dbname"        = "mydb"
"database.hostname"      = "myhostname"
"database.password"      = "mypass"
"database.port"          = "myport"
"database.server.name"   = "postgres"
"database.user"          = "myuser"
"database.whitelist"     = "my_database"
"include.schema.changes" = "false"
"plugin.name"            = "wal2json_streaming"
"slot.name"              = "my_slotname"
"snapshot.mode"          = "never"
"table.whitelist"        = "public.mytable"
"tombstones.on.delete"   = "false"
"transforms"             = "key"
"transforms.key.field"   = "id"
"transforms.key.type"    = "org.apache.kafka.connect.transforms.ExtractField$Key"

If I get the status of this connector, it appears to be fine.

curl -s http://my.kafkaconnect.url:kc_port/connectors/my-connector/status | jq

{
  "name": "my-connector",
  "connector": {
    "state": "RUNNING",
    "worker_id": "some_ip"
  },
  "tasks": [
    {
      "id": 0,
      "state": "RUNNING",
      "worker_id": "some_ip"
    }
  ],
  "type": "source"
}

However, the replication slot in postgres keeps getting larger and larger:

SELECT slot_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
  active
FROM pg_replication_slots;
           slot_name           | replicationslotlag | confirmedlag | active
-------------------------------+--------------------+--------------+--------
 my_slotname                   | 20 GB              | 20 GB        | t

Why does replication keep growing? As I understand, the kafka connect connector task that is running should be reading from this replication slot, publishing it to the topic postgres. public.mytable, and then the replication slot should decrease in size. Am I missing something in this chain of actions?

Solution

Please take a look at WAL Diskspace Consumption.

The most common reason why the PostgreSQL WAL backlogs is because the connector is monitoring a database or a subset of tables from your database change much more infrequent compared to the other tables or databases in your environment and therefore the connector isn't acknowledging the LSNs frequent enough to avoid the WAL backlog.

For Debezium 1.0.x and before, enable heartbeat.interval.ms.
For Debezium 1.1.0 and after, also consider enabling heartbeat.action.query as well.