Search code examples
sql-serverindexingdatabase-designcqrsevent-sourcing

Event Sourcing/CQRS Clustered Index and Partitioning


I work at Tax processing company using SQL Server 2016. We process millions of tax information returns, and setup concurrent, multi threading, parallel processing system.

  1. For a parallel processing write event store, What should clustered index be on? on UniqueIdentifier Guid , or (Clustered index on Identity(1,1) with Ncx on Uniqueidentifier guid)?, or no clustered index (utilize heap)?

  2. Do you generally recommend partitioning write event store table?

  3. When we update our read-model for querying, should we still utilize parallel processing to update the read model? Or should we conduct single stream update?

  4. Also, again what should clustered index be on the read-event model, UniqueIdentifierGuid or (Clustered index on Identity(1,1) with Ncx on Uniqueidentifier guid)?

  5. should we partition the read-model table or any other methods?

There is a general rule that indexes on uniqueidentifierguids are bad clustered index, cause massive page fragmentation, slower io writes, and large disk space. https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/guid-vs-int-debate/

However, indexes on identity(1,1) integer columns cause latch contention, last page insert “hot spots” in parallel processing. http://www.sqlpassion.at/archive/2014/04/15/an-ever-increasing-clustered-key-value-doesnt-scale/


Solution

  • Your individual needs will depend on the architecture of your unique system. Generally, you'll need to be able to test, measure and profile to determine what your bottlenecks are (or will be).

    For example, the number of nodes you have writing in parallel at any one time will and the throughput you need at any instant.

    Here are two tips to get you started:

    • You probably want a GUID id (indexed) and a second column for the clustered index. I use an Identity (sequential number) column as my Clusered Index because it's generated in the database. Generally, you won't be physically writing to disk in parallel (even if you try to do it in parallel) so just make it fast and simple (and profile it!).

    • For each "Read Model" you generate you'll generally need to process events in serial. You can have multiple "Read Models" and if the data isolated you can build them in parallel.

    I'm not sure how familiar with EventSourcing you are but I can't recommend these two resources enough.

    http://docs.geteventstore.com/introduction/4.0.2/event-sourcing-basics/ https://leanpub.com/esversioning