I have a graph, that represents database objects, parent-child relations and dataflows relations (only in-between columns).
Here is my current gremlin query (in python), that should find dataflow impact of a column:
g.V().has('fqn', 'some fully qualified name').
repeat(outE("flows_into").dedup().store('edges').inV()).
until(
or_(
outE("flows_into").count().is_(eq(0)),
cyclicPath(),
)
).
cap('edges').
unfold().
dedup().
map(lambda: "g.V(it.get().getVertex(0).id()).in('child').in('child').id().next().toString() + ',' + g.V(it.get().getVertex(1).id()).in('child').in('child').id().next().toString()").
toList()
This query should return all edges, that are somehow impacted by the initial column. The problem is, that in some cases, I do not care about the column-level stuff and I want to get the edges on 'schema level'. That is wjat the lambda does - for both nodes in the edge, it traverses two times up in the objects tree, which returns the schema node.
The problem is in this lambda function - I cannot just do this:
it.get().getVertex(1).in('child').in('child').id().next().toString()
because getVertex(1) does not return a traversable instance. So I need to start new traversal by g.V()...
. By my debugging, this line causes the horrible slowdown. It gets about 50x slower if I leave this transformation in.
Do you have any ideas how to optimize this query?
You might consider not using a lambda at all, given they tend to not be portable between implementations. Perhaps the map
step could be replaced with a project step something like:
project('v0','v1').
by(outV().in('child').in('child').id())
by(inV().in('child').in('child').id())