Search code examples
sql-serverdatabase-designazuredatabase-performancedatabase-partitioning

Table scaling with partitions or with separate databases?


Let's say I have a table (let's call it BigTable) which could experience 5,000,000 INSERTS per day (with possibly just as many SELECTs). Each row inserted is about 50kb.

These daily INSERTs are split across 5 clients equally (the table has a FK called ClientID). There is never a need to SELECT or JOIN data across multiple clients.

I am worried about the database performance as this table grows, so I have come up with two solutions.

SOLUTION 1:

  • Partition BigTable by ClientID
  • Store each partition on a separate hard disk on the server (using Azure blog storage).
  • Partition all data which is 1 month old (archive data, yet still need to be queryable) into another set of READONLY partitions.

Essentially this means the following partitions on their own storage devices:

  • Primary (all data excluding BigTable)
  • ClientA's BigTable (5,000,000 rows per day / 5 clients x 30 days = 30,000,000 rows)
  • ClientB's BigTable (30,000,000 rows)
  • ClientC's BigTable (30,000,000 rows)
  • ClientD's BigTable (30,000,000 rows)
  • ClientE's BigTable (30,000,000 rows)
  • ClientA's BigTable archive
  • ClientB's BigTable archive
  • ClientC's BigTable archive
  • ClientD's BigTable archive
  • ClientE's BigTable archive

The number of rows in the archive tables will be (5,000,000) x (age of DB in days) - (30,000,000). This is still a huge table, but will only be used to drawing up the odd report.

SQL Server will be hosted on a 14GB, 8core Azure VM.

SOLUTION 2:

The other option is to host separate databases for each client. This means each will have it's own dedicated SQL Server machine. Partitioning will still happen for archive data.

This option is not optimal because of the physical separation of the data. Having to manage updates to multiple databases could be very problematic. Having separate database connections for each client will also be a consideration for the developers.

Could anyone perhaps advise on these options?


Solution

  • Since you have tagged this with [azure] and [sql-server], I assume that you are trying to do this in Windows Azure. If that is the case then a) partitioning by client is not necessarily a good idea, and b) SQL may not be the best (complete) fit for your problem.

    When building scalable architectures, the partitioning strategy shouldn't be based on something specific like 'client', but rather something more arbitrary. The reason is simple — unless clients have a reason to be separate, such as not wanting their data mixed with others, or different SLAs per client, then the choice of 'client' as a partition may not render the optimal result. If 80% of your business is generated by a single client, you have not solved your problem, and still have to maintain n separate databases for marginal load.

    5 mil database inserts per day is not a big number, but may be a big number for SQL Server hosted in Azure IaaS, or Azure SQL Database — due to the performance of the underlying commodity hardware. Before determining how to partition SQL, ask yourself two questions. First, what are the usages and performance characteristics that you want from the data? (Does it have to be immediately consistent? Can you process data asynchronously?) Secondly, have you mapped those characteristics against other data store technologies? Have you considered Table Storage (or non-MS solutions like Redis)?

    You may find, after trying out a few options that:

    • SQL is a good store for some of the data, some of the time.
    • Much of the processing can be done asynchronously, so the peak performance of inserts is of little concern (and doing 5 mil inserts over a 24-hour period is not a problem).
    • SQL may not be suited to long-term storage.
    • Querying of older data can be done effectively using map-reduce, rather than SQL queries.

    For example, I have an app that tracks vehicles at one-second intervals. It is targeted for 100,000 vehicles but architected in such a way as to be able to scale up to millions without changing any code or databases. But in the medium term, it has to cope with 72 mil inserts per day. All of this runs on a single Windows Azure SQL database that is less than 10GB, and a whole bunch of table storage. The reason why this works is because although I want to archive all data (72 mil rows), I don't need complex SQL query access to it, so it sits happily in table storage. What I store in SQL is a summary of the data. So in my example, I am only interested in a vehicle's journey (start and end position, distance travelled etc), which means that I only have two or three rows per vehicle per day that I need in SQL — greatly reducing the load on the database. Also, my bottleneck is in the collection of data, so I add the data immediately to a (Windows Azure) queue — and worry about the summarising of data in a separate workload.

    This answer may be a bit long, but is intended for you to think about your data model more carefully, rather than just trying to think about how to solve the problem with SQL. For more detail, have a look at the data model in CALM.