database-design sharding denormalization

Database sharding, denormalization and synchronization

In sharding, it is advisable to follow a share nothing, shard everything approach. Essentially, this translates to: data that are to be used together are to be stored together (i.e. in a single shard). This fits in nicely with the split your data at the application level philosophy.

But it is not always practical to completely denormalize the data. Consider for example an eCommerce site. All data corresponding to an user should ideally be kept together. However, some data, like product availability will need to be synchronized throughout. In some cases, like the aforementioned one, this sync needs to happen real time. As an example, suppose a user has added products to their wishlist. In principle, all data about the wishlist items should be kept together with the rest of the user's data. However, multiple users can wishlist the same item, and other users can also buy the item, thus affecting its availability. Thus the information needs to be shared and sync'd in some way.

The complexity and performance hit that would come from multiple joins is also well known.

To make the issue even thornier, it is to be expected that in the above example, or in any case requiring proper ACID transactions, the sync has to be practically real time. It might be acceptable if the number of likes is updated with a delay but it would be bad business for an eCommerce site to show delayed product availability data!

What's the best approach in cases like this?

One preliminary line of thought was to use a push/notification mechanism for such fields as availability, transaction status, etc. But I am really keen to hear from the experts. Obviously, I've read about eBay's architecture on High Scalability but it doesn't go into this level of implementation detail.

There is one related pertinent discussion on dba.stackexchange but it has more questions than answers.

EDIT/UPDATE: The db in consideration is Postgres.

Solution

You haven't specified any databases you are looking at specifically, so I am going to talk about how a few of the really high-end lattice frameworks address this on PostgreSQL. If nothing else this gives you a starting point for your answer.

For purposes of this discussion, a lattice framework is one where you essentially have a series of coordinated shards which appear like a single database. Arguably there are some differences, particularly in tooling, but they hit the exact same problems you are asking about. Lattices manage storage nodes through things like two phase commit, allowing coordinators to effectively enforce cross-shard referential integrity. The most commonly used framework in the Pg world is Postgres-XL.

With Postgres-XL (formerly called StormDB), you have a division between your coordinator and storage nodes. Each storage node is effectively a shard. The coordinator nodes manage access to the storage nodes and present the entire set of storage nodes as a single coherent database.

A couple points:

In this sort of fraamework, cross-shard foreign keys are expensive for a number of reasons including join performance and referential integrity enforcement. So good design avoids these.
The solutions to cross-shard foreign keys involve significant tradeoffs.

Your choice in this case is either to just accept the cross-node performance hit or to synchronize the data. If you do the latter, your writes will never be faster than your slowest node, so you don't want to do this with frequently written data.

What this means effectively is you have to identify what data is frequently written and what is mostly stable. The stable data can be synchronized. The volatile data should not be.

So in the ecommerce solution, if your parts catalog is pretty stable, you would synchronize that and then shard customers and orders. Live on-hand data of parts however would not be replicated between the hosts but separately sharded/partitioned so that updates to this could also be made write-extensible. So sometimes you have to think about breaking up existing tables into stable and volatile portions.