spring dynamic eclipselink partitioning multi-tenant

Using partitioning to do multi-schema multi-tenancy with dynamic tenants

I'm writing a web application that must be multi-tenant. I'm using JPA for the persistence layer and I'm evaluating EclipseLink with interest.

The multi-tenant strategy I want to use is: one schema per customer. Hibernate supports such a strategy (http://docs.jboss.org/hibernate/orm/4.2/devguide/en-US/html/ch16.html#d5e4771) and I've already used it with success. However, AFAIK it supports it only when using the native Hibernate API, while I want to use JPA.

EclipseLink, on the other hand, supports single-table and multi-table multi-tenancy strategies. However, it also supports partitioning and with a simple custom partitioning policy I may easily set up one partition for each customer.

The first question might be whether using partitioning for this use case is appropriate or not.

The main problem, however, is that the customer base may (hopefully) grow over time, so I have to make EclipseLink "know" about new customers dynamically (i.e.: without restarting the webapp). From what I understand, to set up partitioning in EclipseLink I have to setup my persistence unit with different "connection pools" (or "nodes"): every node has its configured datasource and a name. On the other hand, the partitioning strategy will determine the node to use by its name. So far so good, but I plan to setup my persistence unit using Spring's LocalContainerEntityManagerFactoryBean. I may discover customers dynamically on startup, when the LocalContainerEntityManagerFactoryBean is processed, so that I can pass all the needed properties for all the nodes/customers by that time, but what happens if a new customer is added afterwards? I don't think that changing persistence unit properties dynamically will have any effect on the already constructed EntityManagerFactory singleton instance... and I fear EclipseLink will complain if I request a partition for which no corresponding node was known at the EntityManagerFactory creation time. Correct me if I'm wrong.

I think that declaring the LocalContainerEntityManagerFactoryBean scope as "prototype" bean would be a very bad idea and I think it won't work at all. On the other hand, since a customer interaction is bound to a specific HTTP session, I may alternatively use a "middle" approach by declaring the LocalContainerEntityManagerFactoryBean scope as "session", but I think that in this case I would have to manage problems like increased memory consumption and shared cache coordination between multiple EntityManagerFactories (one for each customer that is using the application at a given time).

If I can't make this strategy work, I think I'll have to abandon partitioning as a whole and fall back to a "dynamic data source routing" approach, but in this case I'm concerned about the EclipseLink shared cache consistency (I think I'll have to disable it completely and this would be a real disadvantage).

Thanks in advance for any feedback on this.

Solution

Honestly, I didn't try Chris's suggestion, but opted for a more fine-tuned solution. This is my solution.

in my case, tenant = customer; each customer data is in its own database schema, potentially located in a dedicated DBMS instance (of whatever vendor); in other words, I have one different data source per customer
since I use partitioning, this means that every customer has its own partition; each partition is identified by the corresponding unique customer id
every user that logs into the application belongs to a different customer; I use Spring Security to handle authentication and authorization, hence I can retrieve information about the user (including its owning customer) by querying the SecurityContextHolder
I defined my own EclipseLink PartitioningPolicy which determines the customer of the currently logged in user as described in the previous point, and then returns a list containing an only Accessor that identifies that customer partition

all my tables must be partitioned and I don't want to specify that on EVERY entity with annotations, so I registered this partitioning policy into EclipseLink on startup and set it as the default one; briefly:

JpaEntityManagerFactory jpaEmf = entityManagerFactory.unwrap(JpaEntityManagerFactory.class);
ServerSession serverSession = jpaEmf.getServerSession();
serverSession.getProject().addPartitioningPolicy(myCustomerPolicy);
serverSession.setPartitioningPolicy(myCustomerPolicy);

Then, to dynamically add data sources to EclipseLink (they are called "connection pools" in EclipseLink terminology), so that the customer id specified by the policy above is matched against a known "connection pool" in EclipseLink, I do the following:

a listener intercepts any user successful login

this listener queries EclipseLink to see it already knows about a connection pool identified by the user customer id; if it does, we're done, EclipseLink can correctly handle the partition; otherwise a new connection pool is created and added to EclipseLink; proof of concept:

String customerId = principal.getCustomerId();
JpaEntityManagerFactory jpaEmf = entityManagerFactory.unwrap(JpaEntityManagerFactory.class);
ServerSession serverSession = jpaEmf.getServerSession();
if (!serverSession.getConnectionPools().containsKey(customerId)) {
  DataSource customerDataSource = createDataSourceForCustomer(customerId);
  DatabaseLogin login = new DatabaseLogin();
  login.useDataSource(customerId);
  login.setConnector(new JNDIConnector(customerDataSource));
  Class<? extends DatabasePlatform> databasePlatformClass = determineDbVendorPlatform(customerId);
  login.usePlatform(databasePlatformClass.newInstance());
  ConnectionPool connectionPool = new ExternalConnectionPool(customerId, login, serverSession);
  connectionPool.startUp();
  serverSession.addConnectionPool(connectionPool);
}

The user login operation is of course performed against a central data base (or any other source of authentication), so that the above code occurs before any customer-specific JPA query is executed (and hence the customer connection pool is added to EclipseLink before the partitioning policy ever reference it).

There's an important aspect to take into consideration, though. In EclipseLink, data partitioning means that an identifiable piece of data (= an entity instance) is either in just one partition, or equally replicated in multiple partitions. Entity instance identity is determined through the identifier (= primary key). This means that there should not exist two different entity instances of type E with the same id=x for two different customers/tenants T1 and T2, otherwise EclipseLink might think they are the exact same entity instance. This may lead to mixed data from different customers being read/written during a single JPA session => a disaster. Possible solutions:

in this scenario, the partition to use is determined by the currently logged in user; this means that it will be the same for every query executed within the scope of the HTTP session; since I use transaction-scoped entity managers, whose lifespan is at most equal to the request duration (which itself extends well within the HTTP session), just disabling the EclipseLink shared cache avoids the possibility for data from different customers to be mixed; however, this is still undesirable
the best option I could find is to make sure that all ids (= primary keys) are generated and that the generation is handled by EclipseLink in a central cross-customer way, so that id=x for entity E is certainly assigned to just one entity instance of one customer only; this actually means "partitioning" id assignment sequences over customers and prevents the use of MySQL auto-increment columns (aka database identity generation type); so I opted for using table generation type for entity identifiers and putting that table in the central database where user and customer information is stored

The last little problem to solve to implement option 2 correctly is that, even if the EclipseLink documentation says that it's possible to specify a connection pool (= data source) dedicated to table sequencing using the eclipselink.connection-pool.sequence configuration option, this seems to be ignored when a default partitioning policy is set as described above. In fact, my customer partitioining policy gets invoked for EVERY query, even those used for id allocations. For this reason, the policy must intercept those queries and route them to the central data source. I couldn't find a definitive solution to this problem, but the best options I could think of are:

if the SQL string of the query starts with "UPDATE SEQUENCE ", it means that it's a query for id allocation, with the assumption that the table dedicated to sequence allocation is called SEQUENCE (this is the default)
if you adopt the convention to add a SEQUENCE suffix to your generators, if the executed query name ends with "SEQUENCE", it means that it's a query for id allocation

I opted for option 2 by correctly defining my id generation mappings as such:

@Entity
public class MyEntity {
  @Id
  @TableGenerator(name = "MyEntity_SEQUENCE", allocationSize = 10)
  @GeneratedValue(generator = "MyEntity_SEQUENCE")
  private Long id;
}

This makes EclipseLink use a table named SEQUENCE, containing one row whose SEQ_NAME column value is MyEntity_SEQUENCE. The query used to update this sequence for id allocation will be named MyEntity_SEQUENCE and we're done. However I made my partitioining policy configurable so that I can switch from one sequence-query-identification strategy to the other at any time in case something changes in EclipseLink implementation that breaks this "heuristics".

This substantially is the whole picture. For now, it has been working well. Feedback, improvements, suggestions are welcome.