Search code examples
postgresqlasp.net-corekubernetesnpgsqlpgbouncer

Npgsql with Pgbouncer on Kubernetes - pooling & keepalives


I'm looking for more detailed guidance / other people's experience of using Npgsql in production with Pgbouncer.

Basically we have the following setup using GKE and Google Cloud SQL:

enter image description here

Right now - I've got npgsql configured as if pgbouncer wasn't in place, using a local connection pool. I've added pgbouncer as a deployment in my GKE cluster as Google SQL has very low max connection limits - and to be able to scale my application horizontally inside of Kubernetes I need to protect against overwhelming it.

My problem is one of reliability when one of the pgbouncer pods dies (due to a node failure or as I'm scaling up/down).

When that happens (1) all of the existing open connections from the client side connection pools in the application pods don't immediately close (2) - and basically result in exceptions to my application as it tries to execute commands. Not ideal!

As I see it (and looking at the advice at https://www.npgsql.org/doc/compatibility.html) I have three options.

  1. Live with it, and handle retries of SQL commands within my application. Possible, but seems like a lot of effort and creates lots of possible bugs if I get it wrong.

  2. Turn on keep alives and let npgsql itself 'fail out' relatively quickly the bad connections when those fail. I'm not even sure if this will work or if it will cause further problems.

  3. Turn off client side connection pooling entirely. This seems to be the official advice, but I am loathe to do this for performance reasons, it seems very wasteful for Npgsql to have to open a connnection to pgbouncer for each session - and runs counter to all of my experience with other RDBMS like SQL Server.

Am I on the right track with one of those options? Or am I missing something?


Solution

  • You are generally on the right track and your analysis seems accurate. Some comments:

    Option 2 (turning out keepalives) will help remove idle connections in Npgsql's pool which have been broken. As you've written your application will still some failures (as some bad idle connections may not be removed in time). There is no particular reason to think this would cause further problems - this should be pretty safe to turn on.

    Option 3 is indeed problematic for perf, as a TCP connection to pgbouncer would have to be established every single time a database connection is needed. It will also not provide a 100% fail-proof mechanism, since pgbouncer may still drop out while a connection is in use.

    At the end of the day, you're asking about resiliency in the face of arbitrary network/server failure, which isn't an easy thing to achieve. The only 100% reliable way to deal with this is in your application, via a dedicated layer which would retry operations when a transient exception occurs. You may want to look at Polly, and note that Npgsql helps our a bit by exposing an IsTransient exception which can be used as a trigger to retry (Entity Framework Core also includes a similar "retry strategy"). If you do go down this path, note that transactions are particularly difficult to handle correctly.