How transparent should losing access to a Postgres-XL datanode be?

I have set-up a testing Postgres-XL cluster with the following architecture:

gtm - vm00
coord1+datanode1 - vm01
coord2+datanode2 - vm02

I created a new database, which contains a table that is distributed by replication. This means that I should have the exact copy of that table in each and every single datanode.

Doing operations on the table works great, I can see the changes replicated when connecting to all coordinator nodes.

However, when I simulate one of the datanodes going down, while I can still read the data in the table just fine, I cannot add or modify anything, and I receive the following error:

ERROR:  Failed to get pooled connections

I am considering deploying Postgres-XL as a highly available database backend for a fair number of applications, and I cannot control how those applications interact with the database (it might be big a problem if those applications couldn't write to the database while one datanode is down).

To my understanding, Postgres-XL should achieve high availability for replicated tables in a very transparent way and should be able to support losing one or more datanodes (as long as at least one is still available - again, this is just for replicated tables), but this does not seem the case.

Is this the intended behaviour? What can be done in order to be able to withstand having one or more datanodes down?

Solution

So as it turns out not transparent at all. To my jaw dropping surprise at it turns out Postgres-XL has no build in high availably support or recovery. Meaning if you lose one node the database fails. And if you are using the round robbin or hash DISTRIBUTED BY options if you lose a disk in a node you have lost the entire database. I could not believe it, but that is the case.

They do have a "stand by" server option which is just a mirrored node for each node you have, but even this requires manually setting it to recover and doubles the number of nodes you need. For data protection you will have to use the REPLICATION DISTRIBUTED BY option which is MUCH slower and again has no fail over support so you will have to manually restart it and reconfigure it not to use the failing node.

https://sourceforge.net/p/postgres-xl/mailman/message/32776225/

https://sourceforge.net/p/postgres-xl/mailman/message/35456205/