Search code examples
postgresqlreplicationdatabase-replication

Postgres logical replication stuck at LSN value. I want to advance past it with pg_replication_origin_advance


  • Postgres 13 on AWS RDS
  • Replication slot is growing in size
  • CPU and I/O are low on both master and slave
  • The subscriber seems to be stuck at 33533/7D2841D8 even though the received LSN keeps advancing. There is a logical worker that runs on the subscriber because I see it in pg_stat_activity however I'm not sure what it's doing.

I want to pull the plug on whatever this transaction is since it's been taking days. I found the pg_replication_origin_advance function which will allow me to do that, however I don't know what the "next" LSN value. Is there a way I can look that up?


Solution

  • I figured out what my problem was: triggers on the subscriber.

    I saw in pg_locks that a table was constantly being hit. There was an exclusive lock on the table, and no other query wanted to use it which was good. It was just busy all the time.

    Somehow I found pg_stat_user_functions which says how much time a function (like a trigger function) executes for. There were a pair of functions that had an extremely large amount of time running and it kept rising.

    I did an alter table foo disable trigger x, and once that was done, replication immediately started flowing. My data was now out-of-date because the trigger wasn't running, but that's fine as long as my publisher doesn't crash due to running out of space. After replication was caught up, I re-enabled the triggers, and things have been running okay.