Issues when automatically running DDL queries on Galera cluster

We have a Java web application that runs on WildFly 8.2 with MariaDB 10. We are currently evaluating the migration from a single MariaDB server to a three-node Galera/MariaDB cluster with HAProxy 1.5 as a load balancer.

All works fine when only DML commands are issued against the cluster. But there are situations when our Java app runs DDL commands like CREATE TABLE or ALTER TABLE. For example, in the web interface it is possible to create new customers. Any such customer gets 50 new tables to store its data.

Suppose we add the customer with the id 123, the application will run the following mix of DDL and DML queries:

-- start customer creation
create table Table1_123 (id int not null,.., primary key (id));
create table Table2_123 (id int not null, tid int not null,..., primary key (id));
....
alter table Table2_123 add constraint constraintName foreign key (tid) references Table1_123 (id);
...
insert into Table1_123 ...; -- insert one or more values
insert into Table2_123 ...; -- insert one or more values
...
-- define tables and insert data up to Table50_123

We have wsrep_osu_method=toi (the default).

After creating such a customer, I check the newly created tables and many time I see inconsistencies between the nodes. The tables themselves were properly replicated, but not their content. Some new tables on some node contains data that hasn't been replicated to the other nodes.

I made some tests and only connected the Java app to a single node, directly, avoiding HA Proxy. I got the same inconsistencies but the difference was that only the node that the app directly connected to was holding un-replicated data. In the previous test the un-replicated data was scattered on all the nodes.

I don't understand this behavior. I don't get any relevant errors in the logs.

Running such DDL queries is something that happens often. Of course, we would like to do them automatically instead of manually. What would be the best approach for such situation, when the app runs DDLs as wel as DMLs?

Our priorities are the following (in this order):

Maintain consistency during and after DDL are run.
Do the fewest changes in the application code.
If possible run DDLs automatically and not manually.

Solution

Two items to look at:

Have you verified that your cluster still has all nodes with a status of SYNC after this happens? It is possible that the node with missing data has been dropped off of the cluster.
Are any transactions involved? Galera will not play very nicely with very large transactions. If transactions are involved, use them sparingly and only where necessary.

You definitely want to go with TOI here, but doing a single table at a time instead of trying to do them with single novel of SQL may be beneficial. Think of each table as a "batch" of sorts.

If none of this pans out, you could always connect to each node individually and run everything as RSU. But I kinda think that's a workaround more than it is a solution.

Reference: GaleraCluster - Schema Upgrades

EDIT: The answer is in the comments. " It was clearly explained there that XA transactions are not supported by Galera... I switched to non-XA transactions and the problem has gone and the replication works well."