primary-key azure-storage md5 azure-table-storage

Should I split a Primary key into Partition Key and Row Key components?

I want to store data in an Azure Table. The Primary Key for this data will be an MD5 hash.

To get a good balance of performance and scalability it is a good idea to use a combination of both Partition Key and Row Key in the Azure Table.

I am considering splitting the MD5 hash into two parts at an arbitrary point. I will probably use the first three or so characters for the Partition Key so as to have a higher likelihood of collisions, and therefore end up with Partitions that each have a decent quantity of Row entries in them. The rest of the characters will make up the Row Key. This would mean the data is spread over 4,096 Partitions.

The overall dataset could become large, in the order of hundreds of thousands of records.

I am aware that atomic operations can more easily be done across entries in the same Partition; this is not a concern for me.

Is this Key-splitting approach worth considering? Or should I simply go for the simpler approach and have the Partition Key use the entire MD5 hash, with an empty Row Key?

Solution

Both of your approches are fine. Basically, 4096 partitions are sufficient for scaling; if you want even better scalability, use the full MD5 as partition key since you don't need atomic operations with a partition. Please note that row key can't be an empty string, so consider using a constant string or the same value as partition key (full MD5) instead.