AWS Neptune Query gremlins slowness on cold call

I'm currently running some queries with a big gap of performance between first call (up to 2 minutes) and the following one (around 5 seconds).

This duration difference can be seen through the gremlin REST API in both execution and profile mode.

As the query is loading a big amount of data, I expect the issue is coming from the caching functionalities of Neptune in its default configuration. I was not able to find any way to improve this behavior through configuration and would be glad to have some advices in order to reduce the length of the first call.

Context :

The Neptune database is running on a db.r5.8xlarge instance, and during execution CPU always stay bellow 20%. I'm also the only user on this instance during the tests.

As we don't have differential inputs, the database is recreated on a weekly basis and switched to production once the loader has loaded everything. Our database have then a short lifetime.

The database is containing slightly above 1.000.000.000 nodes and far more edges. (probably around 10.000.000.000) Those edges are splitted across 10 types of labels, and most of them are not used in the current query.

Query :

// recordIds is a table of 50 ids.
g.V(recordIds).HasLabel("record")
    // Convert local id to neptune id.
    .out('local_id')
    // Go to tree parent link. (either myself if edge come back, or real parent)
    .bothE('tree_top_parent').inV()
    // Clean duplicates.
    .dedup()
    // Follow the tree parent link backward to get all children, this step load a big amount of nodes members of the same tree.
    .in('tree_top_parent')
    .not(values('some flag').Is('Q'))
    // Limitation not reached, result is between 80k and 100K nodes.
    .limit(200000)
    // Convert back to local id for the 80k to 100k selected nodes.
    .in('local_id')
    .id()

Solution

Neptune's architecture is comprised of a shared cluster "volume" (where all data is persisted and where this data is replicated 6 times across 3 availability zones) and a series of decoupled compute instances (one writer and up to 15 read replicas in a single cluster). No data is persisted on the instances however, approximately 65% of the memory capacity on an instance is reserved for a buffer pool cache. As data is read from the underlying cluster volume, it is stored in the buffer pool cache until the cache fills. Once the cache fills, a least-recently-used (LRU) eviction policy will clear buffer pool cache space for any newer reads.

It is common to see first reads be slower due to the need to fetch objects from the underlying storage. One can improve this by writing and issuing "prefetch" queries that pull in objects that they think they might need in the near future.

If you have a use case that is filling buffer pool cache and constantly seeing buffer pool cache misses (a metric one can see in the CloudWatch metrics for Neptune), then you may also want to consider using one of the "d" instance types (ex: r5d.8xlarge) and enabling the Lookup Cache feature [1]. This feature specifically focuses on improving access to property values/literals at query time by keeping them in a directly attached NVMe store on the instance.

[1] https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-lookup-cache.html