Search code examples
amazon-dynamodbamazon-dynamodb-index

Storage cost / supportability / performance tradeoffs using compact attributes in DynamoDB


I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).

The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.

The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.

My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes], for example two UUIDs concatenated together into 32 bytes.

I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:

  1. Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.

If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)

  1. Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?

This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.

  1. Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?

This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.


Solution

  • No, the PK/SK types are not padded. There's 100 bytes of overhead per item stored.

    Sending less data certainly won't hurt your performance. Don't expect a noticeable improvement though. If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.

    For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).