Search code examples
postgresqlamazon-web-servicesamazon-rdsdmswal

AWS DMS task failing after some time in CDC mode


I'm having trouble in setting up a task migrating the data in a RDS Database (PostgreSQL, engine 10.15) into an S3 bucket in the initial migration + CDC mode. Both endpoints are configured and tested successfully. I have created the task twice, both times it ran a couple of hours at most, the first time the initial dump went fine and some of the incremental dumps took place as well, the second time only the initial dump finished and no incremental dump was performed before the task failed.

The error message is now:

Last Error Task 'data-migration-bp-dev' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL_

but just after it failed for the first time it was:

Last Error An internal WAL conversational protocol error has occurred. Task error notification received from subtask 0, thread 0 reptask/replicationtask.c:2859 1020452 Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated reptask/replicationtask.c:2866 1020452 Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE

In the CloudWatch logs I see the following error messages:

SOURCE_CAPTURE I: Streaming initiated successfully (postgres_pglogical.c:274)
SOURCE_CAPTURE I: #1 : Non-monotonic LSN sequence: Current LSN '00000000/00000000' < Previous LSN '000001E3/94016430'. Event is ignored. (postgres_endpoint_wal_engine.c:710)
SOURCE_CAPTURE I: Unable to resolve attributes for relation id '28804'. Aborting action. (postgres_pglogical.c:1643)
SOURCE_CAPTURE I: End of CDC / CAPTURE events for POSTGRES endpoint. (postgres_endpoint_capture.c:520)
SOURCE_CAPTURE I: CAPTURE ended with exceptions. (postgres_endpoint_capture.c:527)
SOURCE_CAPTURE E: Could not find relation id '28804' in hash. 1020483 (postgres_pglogical.c:1470)
SOURCE_CAPTURE E: Failed to parse relation from dml command 1020483 (postgres_pglogical.c:2515)
SOURCE_CAPTURE E: Failed to find relation id on target while processing message from source 1020452 (postgres_endpoint_wal_engine.c:805)
SOURCE_CAPTURE E: WAL stream loop ended abnormally. (STATUS_PROTOCOL_ERROR) 1020452 (postgres_endpoint_wal_engine.c:992)
SOURCE_CAPTURE E: WAL reader terminated with irrecoverable error. 1020452 (postgres_endpoint_capture.c:496)
TASK_MANAGER I: Task - data-migration-bp-dev is in ERROR state, updating starting status to AR_NOT_APPLICABLE (repository.c:5102)
SOURCE_CAPTURE E: Error executing source loop 1020452 (streamcomponent.c:1870)
TASK_MANAGER E: Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev 1020452 (subtask.c:1409)
SOURCE_CAPTURE E: Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (subtask.c:1578)
TASK_MANAGER E: Task error notification received from subtask 0, thread 0 1020452 (replicationtask.c:2859)
TASK_MANAGER E: Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (replicationtask.c:2866)
TASK_MANAGER E: Task 'data-migration-bp-dev' encountered a recoverable error, retry attempt # 0 (repository.c:5184)

At this point I should mention, that we had to configure the pglogical plugin and restart the database, but we got an error in the end, which we ignored since the DMS task started after that operation.

ERROR: current database is not configured as pglogical node
HINT: create pglogical node first

Is the problem of our failing DMS task related to the pglogical plugin configuration? If so, how can we configure it for it to work (our db engine should be compatible with it, no?)? And if not, how to fix it?

Thank you in advance!


Solution

  • Should anyone get the same error in the future, here is what we were told by the AWS tech specialist:

    There is a known (to AWS) issue with the pglogical plugin. The solution requires using the test_decoding plugin instead.

    1. Enforce using the test_decoding plugin on the DMS Endpoint by specifying pluginName=test_decoding in Extra Connection Attributes
    2. Create a new DMS task using this endpoint (using the old task may cause it to fail due to dissynchronization between the task and the logs)

    It sure did resolve the issue, but we still don't know what the problem really was with the plugin that is strongly suggested everywhere in the DMS documentation (at the moment).