Search code examples
postgresqldockerdocker-swarmdocker-networkesxi

Postgresql Docker container not receiving TCP requests in swarm mode


I am not quite sure where my problem is, I can only describe some symtoms, so please be patient with error logs/configurations.

I want to install a HA postgresql database. The easiest ways to me seems to do it via preconfigured docker images. I am using the bitnami postgresql image for this with the following configuration in swarm mode on two separate nodes.

version: '3.8'

services:
  postgresql-master:
    image: 'docker.io/bitnami/postgresql:15'
    ports:
      - '5432:5432'
    networks:
      - postgres_network
    volumes:
      - '/localVol:/bitnami/postgresql'
    environment:
      - POSTGRESQL_REPLICATION_MODE=master
      - POSTGRESQL_REPLICATION_USER=repmgr_username
      - POSTGRESQL_REPLICATION_PASSWORD=repmgr_password
      - POSTGRESQL_USERNAME=username
      - POSTGRESQL_PASSWORD=password
      - POSTGRESQL_DATABASE=dbname
      - POSTGRESQL_SYNCHRONOUS_COMMIT_MODE=on
      - POSTGRESQL_NUM_SYNCHRONOUS_REPLICAS=1

    deploy:
      placement:
        constraints:
          - node.labels.type == primary

  postgresql-slave:
    image: 'docker.io/bitnami/postgresql:15'
    ports:
      - '5432'
    networks:
      - postgres_network
    depends_on:
      - postgresql-master
    environment:
      - POSTGRESQL_USERNAME=username
      - POSTGRESQL_PASSWORD=password
      - POSTGRESQL_REPLICATION_MODE=slave
      - POSTGRESQL_REPLICATION_USER=repmgr_username
      - POSTGRESQL_REPLICATION_PASSWORD=repmgr_password
      - POSTGRESQL_MASTER_HOST=postgresql-master
      - POSTGRESQL_MASTER_PORT_NUMBER=5432
    volumes:
      - '/localVol:/bitnami/postgresql'
    deploy:
      placement:
        constraints:
          - node.labels.type != primary
networks:
  postgres_network:
    driver: overlay
    external: false
    internal: true
    ipam:
      config:
        - subnet: 10.70.1.0/24

The swarm is created via a simple init command and the node is joined via the join command. No extra config.

When running this file with docker compose up (without the deploy constraints) on one host, the two containers are up and running, replicating the database and so on. Working as desired.

When running this file as is with docker stack up, the primary is running and stable, the secondary is not; see logs

Primary

postgresql 14:07:57.00 INFO  ==> ** Starting PostgreSQL setup **
postgresql 14:07:57.06 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql 14:07:57.07 INFO  ==> Loading custom pre-init scripts...
postgresql 14:07:57.09 INFO  ==> Initializing PostgreSQL database...
postgresql 14:07:57.10 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql 14:07:57.17 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql 14:07:57.21 INFO  ==> Configuring replication parameters
postgresql 14:07:57.28 INFO  ==> Configuring fsync
postgresql 14:07:57.31 INFO  ==> Loading custom scripts...
postgresql 14:07:57.31 INFO  ==> Enabling remote connections
postgresql 14:07:57.33 INFO  ==> ** PostgreSQL setup finished! **
postgresql 14:07:57.34 INFO  ==> ** Starting PostgreSQL **
2022-11-16 14:07:57.363 GMT [1] LOG:  pgaudit extension initialized
2022-11-16 14:07:57.374 GMT [1] LOG:  starting PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-11-16 14:07:57.377 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-11-16 14:07:57.378 GMT [1] LOG:  could not create IPv6 socket for address "::": Address family not supported by protocol
2022-11-16 14:07:57.380 GMT [1] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-11-16 14:07:57.384 GMT [83] LOG:  database system was shut down at 2022-11-16 14:07:36 GMT
2022-11-16 14:07:57.392 GMT [1] LOG:  database system is ready to accept connections

secondary

postgresql 14:40:51.58 INFO  ==> ** Starting PostgreSQL setup **
postgresql 14:40:51.64 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql 14:40:51.65 INFO  ==> Loading custom pre-init scripts...
postgresql 14:40:51.66 INFO  ==> Initializing PostgreSQL database...
postgresql 14:40:51.70 INFO  ==> pg_hba.conf file not detected. Generating it...
postgresql 14:40:51.70 INFO  ==> Generating local authentication configuration
postgresql 14:40:51.74 INFO  ==> Waiting for replication master to accept connections (60 timeout)...
postgresql-master:5432 - no response

The secondary restarts itself after a time of constantly logging no response.

I have tried pinging the containers which works. Also when exposing the port of the primary to the host, it is possible to access the database from the host BUT it is not possible to send any TCP traffic to both container as tried with netcat and tcpdump. Netcat is able to send packets, but tcpdump on the primary and secondary does not show requests.

Anybody got a tip for me?


Solution

  • I just found the error.

    As someone states in his blog, a specific port (4789) is blocked when virtualising with an ESXi stack. This is the default port for overlay network traffic.

    Simply changing that port when initialising a swarm solves the problem.

    docker swarm init --data-path-port 4788