Search code examples
pythonpostgresqlapache-sparkpysparkspark-jdbc

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?


I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs

FATAL:  sorry, too many clients already

despite being no-where close to its configured limit of connections. This query:

SELECT current_setting('max_connections') AS max,
       COUNT(*) AS total
FROM pg_stat_activity

tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.

Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.

For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';

then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?


Solution

  • This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.

    I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.

    However, you can control this number by setting numPartitions option:

    The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

    Example:

    spark.read.format("jdbc") \
              .option("numPartitions", "20") \
    # ...