Search code examples
graphhbasegremlinjanusgraphgremlinpython

Gremlin - optimize query


I have a graph, that represents database objects, parent-child relations and dataflows relations (only in-between columns).

Here is my current gremlin query (in python), that should find dataflow impact of a column:

g.V().has('fqn', 'some fully qualified name').
repeat(outE("flows_into").dedup().store('edges').inV()).
until(
    or_(
        outE("flows_into").count().is_(eq(0)),
        cyclicPath(),
    )
).
cap('edges').
unfold().
dedup().
map(lambda: "g.V(it.get().getVertex(0).id()).in('child').in('child').id().next().toString() + ',' + g.V(it.get().getVertex(1).id()).in('child').in('child').id().next().toString()").
toList()

This query should return all edges, that are somehow impacted by the initial column. The problem is, that in some cases, I do not care about the column-level stuff and I want to get the edges on 'schema level'. That is wjat the lambda does - for both nodes in the edge, it traverses two times up in the objects tree, which returns the schema node.

The problem is in this lambda function - I cannot just do this:

it.get().getVertex(1).in('child').in('child').id().next().toString()

because getVertex(1) does not return a traversable instance. So I need to start new traversal by g.V().... By my debugging, this line causes the horrible slowdown. It gets about 50x slower if I leave this transformation in.

Do you have any ideas how to optimize this query?


Solution

  • You might consider not using a lambda at all, given they tend to not be portable between implementations. Perhaps the map step could be replaced with a project step something like:

    project('v0','v1').
      by(outV().in('child').in('child').id())
      by(inV().in('child').in('child').id())