database postgresql kubernetes architecture cloud

Is running a database in Kubernetes an antipattern?

Let's say we are running some services in a Kubernetes cluster and one of them requires a PostgreSQL instance, expected to persist data reliably. Should the DB live in the cluster or be configured separately?

Imagine that the DB is deployed in the cluster. This probably means one of the following:

We need a process for migrating the data to another node in case the current one goes down. This sounds like a non-trivial task. Or:
The node where the DB lives has to be treated in a special way. Horizontal scaling must be constrained to the other nodes and the cluster ceases to be homogeneous. This might be seen as a design flaw, going against the spirit of maintaining disposable, replaceable containers.

Point (1) applies only to self-managed clusters where all the storage we have at our disposal is tied to machines where the nodes run. If we are using a managed cloud, we can use persistent volume claims and a new instance can pick up the data automatically. Still, this means that if the node with the DB is removed, we will suffer a database downtime until a new instance comes up. So point (2) remains valid also for managed K8s offerings.

Therefore I can well understand the argument for keeping the DB outside of Kubernetes. What would some counterarguments look like? There are a lot of official helm charts for various DBs which suggests that people keep their DBs in Kubernetes clusters after all.

Happy to learn some critical thoughts!

Solution

This is not an anti-pattern. It is just difficult to implement and manage.

Point 1

In a self hosted cluster also you can have persistent volume storage provisioned though GlusterFS and CEPH. So, you don't always have to use ephemeral storage. So, Point 1 is not fully valid.
The DBs are generally created as a statefulsets, where every instance gets its own copy of data.

Point 2

When your DB cluster horizontally scales, the 'init' container of the new DB pod or a CRD provided by the DB needs to register the 'secondary' DB pod so it becomes the part of your dB cluster
A statefulset needs to also run as a headless service so the IPs of each endpoint is also known all the time for cluster healthcheck and primary->secondary data sync and to elect a new primary selection in case the primary node goes down
So, as long as the new pods register themselves to the DB cluster, you will be okay to run your db workload inside a kubernetes cluster

Further reading: https://devopscube.com/deploy-postgresql-statefulset/