I am trying to understand the discrepancy between the sizes of my raw S3 file and the volume of Neptune as I load it. I am testing a small percentage of my original graph (~15%, only vertices), in which the raw CSV size is 3.1GB (no compression) but when it loads into Neptune, it appears to be 59.6GB. I understand there is 10GB of size that is dynamically added, but even so, I feel 50GB+ is excessive as a result given my initial dataset. This is a brand new cluster.
For my test, I just have 4 properties (2 strings, 2 integers) with single cardinality. I have 90 million vertices (no edges, just testing the delta in volume). My true scenario is 600+ million vertices and probably 2x that for edges. When we load the entire dataset, we are approaching 2TB of data and performance issues start to arise (having to go to the volume storage, no cache).
Is there documentation, similar to what DynamoDB has, around the size estimates when it comes to properties, etc.? I want to take these into account when designing a new data model or data fetching strategy.
Dynamo link: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
Thanks!
It is expected that the size of the graph data in the storage volume will be larger than the size of the CSV file. Neptune automatically maintains three indices over the data by default. A fourth index can also be optionally enabled. There are also additional data structures maintained to help with efficiently storing and looking up data.
As to the concern about the size on disk and the memory of the instances, keep in mind that Neptune instances only cache in memory the data needed to answer the queries they are sent. Neptune does not need (nor try) to load an entire index into memory ahead of time. Just the parts needed to answer a query are fetched and cached. The query engine will decide which parts of which indices it needs as part of query planning, optimization and execution.
The exact amount of storage used once a CSV has been loaded will vary depending on things such as the type of properties and whether or not the fourth index has been enabled. It is hard to provide a precise formula but it is definitely expected to see the amount of storage used be quite a bit more than the size of the CSV file.
Note that if you enable the Neptune Streams feature, that will also take up additional storage so that the stream is persisted.
UPDATED 2022-01-07
I should have added this link in the original answer. It points to documentation that explains in more detail how data is stored by Neptune.
As to the buffer pool cache on each instance, a substantial part of the instance memory is dedicated to that cache. When the cache is cold you will see the BufferCacheHitRatio
CloudWatch metric dip below 99.9% (or close to it). That is an indicator that required data is not all in the cache and had to be fetched from the storage volume. As the cache warms up, that metric should stay right up around the 99.9% range unless you fill the cache and some old pages have to be evicted or touch data not previously touched.
If you know the parts of the graph you expect to touch often you can certainly run queries that warm up the cache. Be aware that each instance (Primary and any read replicas) maintains its own unique cache based on the queries it has seen. So you may want to direct certain queries to specific instances if you have more than just a Primary instance (which is recommended for HA).