pgpool 4.1.0 healthcheck getsockopt() detected error "Connection refused"

I am trying to setup a pgpool loadbalancer for a Postgresql streaming replication cluster.

I am using postgresql-12 and pgpool2-4.1.0 from the Postgresql repo https://apt.postgresql.org/pub/repos/apt/ on Debian 10.2 (latest stable).

I have setup Postgresql cluster with streaming replication using physical slots (not WAL shipping) and everything seems to be working properly. The secondaries connect replicate data without any issues.

Then I installed the pgpool2-4.1.0 on the same servers. I have made the proper modifications to pgpool.conf according to the pgpool wiki and I have enabled the watchdog process.

When I start pgpool, on all three nodes, I can see that watchdog is working properly, quorum exists and pgpool elects a master (pgpool node) which also enables the virtual IP from the configuration.

I can connect to the postgres backend via pgpool and issue read and write commands successfully.

The problem appears on the pgpool logs, from syslog, I get:

Jan 13 15:10:30 debian10 pgpool[9826]: 2020-01-13 15:10:30: pid 9870: LOG:  failed to connect to PostgreSQL server on "pg1:5433", getsockopt() detected error "Connection refused"

Jan 13 15:10:30 debian10 pgpool[9826]: 2020-01-13 15:10:30: pid 9870: LOCATION:  pool_connection_pool.c:680

When checking the PID mentioned above, I get the pgpool healthcheck process. I pg1, pg2, pg3 are the database servers listening on all addresses on port 5433, pg1 is the primary. pgpool listens on 5432.

The database user that is used for the healthcheck is "pgpool", I have verified that I can connect to the database using that user from all hosts on the particular subnet.

When I disable the healthcheck the issue goes away, but the defeats the purpose. Any ideas?

Solution

Turns out it was name resolution in the /etc/hosts file and the postgresql.conf.

Specifically, the /etc/hosts was like this:

vagrant@pg1:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 pg1
....
10.10.20.11 pg1
....

And postgresql.conf like this:

....
listen_addresses = 'localhost,10.10.20.11' # what IP address(es) to listen on;
....

So when healthcheck tried to reach the local node on every machine, it would check via hostname (pg1, pg2, etc). With the hosts file above that leads to 127.0.1.1 that postgresql doesn't listen, so it would fail, hence the error, and then try with the 10.10.20.11 which would be successful. That also explains why there was no error from healthchecks of remote hosts.

I changed the hosts file to the following:

vagrant@pg1:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 pg1-local
....
10.10.20.11 pg1
....

And the logs are clear.

This is Debian specific, as Red Hat-based distros don't have a

127.0.1.1 hostname

record in their /etc/hosts