Search code examples
postgresqldockercontainersdocker-swarm

Using Docker, what triggered PANIC: could not locate a valid checkpoint record


I am trying to understand Docker a little better, and in doing so, it appears I corrupted my PostgreSQL DB for my application.

I am using Docker Swarm to start my application and I'm getting the following error in a loop in the PostgreSQL Container:

    2021-02-10 15:38:51.304 UTC 120 LOG:  database system was shut down at 2021-02-10 14:49:14 UTC
    2021-02-10 15:38:51.304 UTC 120 LOG:  invalid primary checkpoint record
    2021-02-10 15:38:51.304 UTC 120 LOG:  invalid secondary checkpoint record
    2021-02-10 15:38:51.304 UTC 120 PANIC:  could not locate a valid checkpoint record
    2021-02-10 15:38:51.447 UTC 1 LOG:  startup process (PID 120) was terminated by signal 6
    2021-02-10 15:38:51.447 UTC 1 LOG:  aborting startup due to startup process failure
    2021-02-10 15:38:51.455 UTC 1 LOG:  database system is shut down

Initially, I was trying to modify the pg_hba.conf file in the container by going to the mount drive in the FS, which is in

 /var/lib/docker/volumes/postgres96-data-volume/_data

However, every time I restarted the container my changes to pg_hba.conf were reverted. So this morning I added a dummy file called test in the mount folder and restarted the container expecting the file to be deleted to get a visual validation that restarting the container automatically replaces everything in that mount to it's original format. After restarting it again, that's when I started getting those error messages preventing my application from starting.

I deleted the test file and restarted the container again, but the error message continues.

I read many solutions on how to fix it, but my question is more to understand why adding a file would cause that? Is my volume corrupted simply because I added a file in there?

Thanks


Solution

  • This error means the Postgres volume is corrupted. This can happen when two containers try to connect to the same volume at the same time. See this answer for slightly more info. Not sure how modifying a file corrupted the drive. You'll need to delete and recreate the volume though. To do this you can:

    $ docker stop <your_container_name> # stops a running container
    $ docker image prune # removes all images that are not attached to a container
    $ docker volume ls # list out active volumes
    $ docker volume rm <volume_name> # Remove the volume that's corrupted
    

    I had to run the above code to stop a container, clean images that somehow weren't attached to any containers and then finally delete the offending volume where corrupted data was held.