amazon-web-services gremlin tinkerpop amazon-neptune

AWS Neptune Schema Optimization - Billions of nodes and edges

I am creating an AWS Neptune graph that will eventually have billions of nodes and edges. With this kind of data volume, I was wondering if there are some best practices when creating the schema to optimize for queries. One thing in particular that I was curious about is whether there is a major performance difference when querying by property vs. ID:

g.V().has('application', 'applicationId', 'application_123')...

vs.

g.V('application_123')...

I would assume starting a query with ID in a graph with billions of nodes and edges would be substantially faster. I was wondering if anyone had any experience with this. If this is the case I could give my nodes IDs that I know at query time that way I can always query by ID. For instance, application nodes would have IDs like application_123 and phone nodes would have IDs like phone_1234567890 where (123) 456 7890 is the phone number. Would this improve query performance? Anything else I can do to improve query performance on a graph with billions of nodes and edges?

Solution

In general, when using Amazon Neptune with Gremlin, if you are able to provide your own (meaningful to your domain) IDs for vertices, that will be the most efficient way to look up a specific vertex. Each vertex ID has to be unique, so as long as you are able to meet that constraint in a meaningful way for your application, that is a sound approach to take. Looking up properties is still efficient, as it's backed by an index, but using an ID is the most efficient way to find a vertex or set of vertices.

It is tricky to give too much generic advice about how to model things as that will, in large part, depend on the access patterns into the data that you need to optimize for and that in turn will inform the choice of data model.