Search code examples
performancequery-optimizationgremlingraph-databasesamazon-neptune

Optimizing Gremlin Query Performance for Filtering and Retrieving Data with many properties


I have a Gremlin query that applies multiple filters to a graph traversal, and I'm seeking advice on optimizing its performance. The query structure is as follows:

g.V()
    .hasLabel('vertex1_label')
    .has('Property1', eq(49))
    .has('Property2', eq(14))
    .as('vertex1')
    .out('edge_json')
    .as('vertexJson_label')
    .select('vertex_label')
    .project('properties', 'edges')
    .by(
        select('vertexJson_label').valueMap()
    )
    .by(
        coalesce(
            outE('edge_label').valueMap().fold(), 
            constant(null)
        )
    ).toList()

In my graph, "vertex1_label" has a large number of properties used for filtering, and it can be connected to other "vertex1_label" instances. Additionally, I introduced "vertexJson_label" to store only a JSON property containing the necessary data from "vertex1_label" to improve query performance. This was necessary because using valueMap() directly on "vertex1_label" was too slow, even when specifying the required properties as parameters. After implementing the JSON solution, the query time was reduced significantly, from 9 seconds to 2.4 seconds, to retrieve 10,000 vertices and 5,000 links from a total of 40,000 vertices in the graph. Also, "edge_label" contains some properties that has to be returned to user. Another condition, can be exist "vertex1_label" without connections.

I'm not an expert in Gremlin, and I'm looking for advice on whether there are ways to further improve query performance as the graph continues to grow. It seems that Amazon Neptune has a "Neptune lookup cache," but I'm not sure if it would benefit my query. Any suggestions or insights would be greatly appreciated.

I've already tried optimizing the query by introducing a 'vertexJson_label' to store a JSON property with the necessary data, which improved performance significantly. Initially, I expected the query to be faster by specifying the required properties with valueMap() directly, but it was slow.

With the JSON solution, the query time improved from 9 seconds to 2.4 seconds for retrieving 10,000 vertices and 5,000 links in a graph with 40,000 vertices. However, as the graph continues to grow, I'm interested in exploring further optimization options to ensure the query's efficiency.

I've heard about Amazon Neptune's 'Neptune lookup cache,' but I'm not sure if it's applicable or beneficial for my use case. Any insights, suggestions, or best practices for Gremlin query optimization in a large graph database would be greatly appreciated.

*******************************************************
                Neptune Gremlin Profile
*******************************************************

Query String
==================
g.V().hasLabel('Location').has('CorporateMember_Id', eq(49)).as('Location').out('LinkedProperty_Json').as('LocationProperty_Json').select('Location').project('properties', 'edges').by(select('LocationProperty_Json').valueMap()).by(coalesce(outE('Linked').valueMap().fold(), constant(null)))

Original Traversal
==================
[GraphStep(vertex,
    []), HasStep([~label.eq(Location), CorporateMember_Id.eq(49)
    ])@[Location
    ], VertexStep(OUT,
    [LinkedProperty_Json
    ],vertex)@[LocationProperty_Json
    ], SelectOneStep(last,Location,null), ProjectStep([properties, edges
    ],
    [
        [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
        ],
        [CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge), PropertyMapStep(value), FoldStep
                ],
                [ConstantStep(null)
                ]
            ])
        ]
    ])
]

Optimized Traversal
===================
Neptune steps: [
    NeptuneGraphQueryStep(Vertex)@[LocationProperty_Json
    ] {
        JoinGroupNode {
            PatternNode[(?1, <CorporateMember_Id>, ?13, ?) . project distinct ?1 . ContainsFilter(?13 in (49^^<SBYTE>,
                49^^<SHORT>,
                49^^<INT>,
                49^^<LONG>,
                49.0^^<FLOAT>,
                49.0^^<DOUBLE>)) .
            ],
            {estimatedCardinality=18048, expectedTotalInput=-1, expectedTotalOutput=18048, indexTime=0, joinTime=35, numSearches=1, actualTotalOutput=10574
            }
            PatternNode[(?1, <~label>, ?2=<Location>, <~>) . project ask .
            ],
            {estimatedCardinality=81004, expectedTotalInput=18048, joinRatio=1.0, expectedTotalOutput=18048, indexTime=12, joinTime=123, numSearches=11, actualTotalOutput=10574
            }
            PatternNode[(?1, ?5=<LinkedProperty_Json>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .
            ],
            {estimatedCardinality=75988, expectedTotalInput=18048, indexTime=19, joinTime=176, numSearches=11
            }
        },
        {path=[Vertex(?1):GraphStep@[Location
                ], Vertex(?3):VertexStep@[LocationProperty_Json
                ], Vertex(?1):SelectOneStep
            ], joinStats=true, optimizationTime=3, maxVarId=14, executionTime=2579
        }
    },
    NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: ProjectStep([properties, edges
],
[
    [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
    ],
    [CoalesceStep([
            [VertexStep(OUT,
                [Linked
                ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
            ],
            [ConstantStep(null)
            ]
        ])
    ]
]),

WARNING: >> [ProjectStep([properties, edges
    ],
    [
        [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
        ],
        [CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
                ],
                [ConstantStep(null)
                ]
            ])
        ]
    ])
] << (or one of the children for each step) is not supported natively yet

Physical Pipeline
=================
NeptuneGraphQueryStep@[LocationProperty_Json
]
    |-- StartOp
    |-- JoinGroupOp@5344294c
        |-- DynamicJoinOp@54e43dbc(PipelineJoinOp(PatternNode[(?1, <CorporateMember_Id>, ?13, ?) . project distinct ?1 . ContainsFilter(?13 in (49^^<SBYTE>,
    49^^<SHORT>,
    49^^<INT>,
    49^^<LONG>,
    49.0^^<FLOAT>,
    49.0^^<DOUBLE>)) .
],
{estimatedCardinality=18048, expectedTotalInput=-1, expectedTotalOutput=18048
}))
        |-- SpoolerOp(1000, @54e43dbc, null)
        |-- DynamicJoinOp@4498cbbe(PipelineJoinOp(PatternNode[(?1, <~label>, ?2=<Location>, <~>) . project ask .
],
{estimatedCardinality=81004, expectedTotalInput=18048, joinRatio=1.0, expectedTotalOutput=18048
}))
        |-- SpoolerOp(1000, @4498cbbe, @54e43dbc)
        |-- DynamicJoinOp@7fe16b82(PipelineJoinOp(PatternNode[(?1, ?5=<LinkedProperty_Json>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .
],
{estimatedCardinality=75988, expectedTotalInput=18048
}))

Runtime (ms)
============
Query Execution: 2579.766
Serialization: 222.424

Traversal Metrics
=================
Step                                                               Count  Traversers       Time (ms)    % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneGraphQueryStep(Vertex)@[LocationProperty...                 10574       10574         400.334    15.79
NeptuneTraverserConverterStep                                      10574       10574          77.076     3.04
ProjectStep([properties, edges
    ],
    [
        [SelectOneStep...                 10574       10574        2057.792    81.17
  SelectOneStep(last,LocationProperty_Json,null)                   10574       10574          51.849
  PropertyMapStep(value)                                           10574       10574         639.459
  CoalesceStep([
                [VertexStep(OUT,
                    [Linked
                    ],edge),...                 10574       10574        1118.509
    VertexStep(OUT,
                    [Linked
                    ],edge)                                   5001        5001         662.831
    PropertyMapStep(value)                                          5001        5001         330.494
    NeptuneMemoryTrackerStep                                        5001        5001          19.325
    FoldStep                                                       10574       10574          29.690
    ConstantStep(null)                                                                         0.000
                                            >TOTAL                     -           -        2535.204        -

Predicates
==========
# of predicates: 85

Results
=======
Count: 10574
Response serializer: application/vnd.gremlin-v3.0+gryo
Response size (bytes): 5,
                    238,
                    081


Index Operations
================
Query execution:
    # of statement index ops: 26172
    # of unique statement index ops: 26172
    Duplication ratio: 1.0
    # of terms materialized: 1026
Serialization:
    # of statement index ops: 0
    # of terms materialized: 0

Solution

  • If you notice in the profile output:

    WARNING: >> [ProjectStep([properties, edges
        ],
        [
            [SelectOneStep(last,LocationProperty_Json,null), PropertyMapStep(value)
            ],
            [CoalesceStep([
                    [VertexStep(OUT,
                        [Linked
                        ],edge), PropertyMapStep(value), NeptuneMemoryTrackerStep, FoldStep
                    ],
                    [ConstantStep(null)
                    ]
                ])
            ]
        ])
    ] << (or one of the children for each step) is not supported natively yet
    

    This means that, starting with the project() step, everything after is not being executed using Neptune-specific query operators and falling back to using TinkerPop operators. It is often desirable to try to rewrite the query in a way that gets executed using 100% Neptune-specific operators. Sometimes it doesn't provide much in terms of performance, but often times it can.

    The reason why it is falling back to TinkerPop is because something inside of the project() step is using a Gremlin step that isn't fully optimized by Neptune [1]. In this particular instance, it is the coalesce() step causing the issue.

    If you don't need an explicit null in the results, I would suggest just getting rid of that and having the query return an empty list in the second by-modulator of the project. Maybe include the full valueMap with id and/or label to ensure for situations where you are getting an actual empty list versus an empty list due to the edges not having properties:

        .project('properties', 'edges')
        .by(
            select('vertexJson_label').valueMap()
        )
        .by(
            outE('edge_label').valueMap(true).fold()
        ).toList()
    

    I did some testing with a sample dataset with 5,000 vertices with random incoming edges and found that by making the change above, it improved the query performance by over two-fold.

    Note that coalesce() is converted to Neptune-specific operators (as of the time of this post) only when used via the "upsert pattern" for writing new data.

    [1] https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-step-support.html