Search code examples
javagremlinjanusgraphgremlinpythongremlin-java

Merge Operator in sack() step does not work as expected


I've been working with sack() step in gremlin. I'm trying to perform some operation on sack when two traversers merge. But for some reason the merge doesn't seem to work as expected. To describe my issue, I have created a sample graph.

enter image description here

Steps to reproduce the graph

g.addE("causes").from(coalesce(V().has("Name","m1"),addV().property("Name","m1"))).to(coalesce(V().has("Name","m2"),addV().property("Name","m2")));
g.addE("causes").from(coalesce(V().has("Name","m1"),addV().property("Name","m1"))).to(coalesce(V().has("Name","m3"),addV().property("Name","m3")));
g.addE("causes").from(coalesce(V().has("Name","m1"),addV().property("Name","m1"))).to(coalesce(V().has("Name","m4"),addV().property("Name","m4")));
g.addE("causes").from(coalesce(V().has("Name","m2"),addV().property("Name","m2"))).to(coalesce(V().has("Name","m5"),addV().property("Name","m5")));
g.addE("causes").from(coalesce(V().has("Name","m2"),addV().property("Name","m2"))).to(coalesce(V().has("Name","m6"),addV().property("Name","m6")));
g.addE("causes").from(coalesce(V().has("Name","m3"),addV().property("Name","m3"))).to(coalesce(V().has("Name","m7"),addV().property("Name","m7")));
g.addE("causes").from(coalesce(V().has("Name","m3"),addV().property("Name","m3"))).to(coalesce(V().has("Name","m8"),addV().property("Name","m8")));
g.addE("causes").from(coalesce(V().has("Name","m6"),addV().property("Name","m6"))).to(coalesce(V().has("Name","m9"),addV().property("Name","m9")));
g.addE("causes").from(coalesce(V().has("Name","m7"),addV().property("Name","m7"))).to(coalesce(V().has("Name","m9"),addV().property("Name","m9")));
g.addE("causes").from(coalesce(V().has("Name","m9"),addV().property("Name","m9"))).to(coalesce(V().has("Name","m10"),addV().property("Name","m10")));

I use sack to count the depth traversed and when two traverser meet, I used minus as sack merge operator.

I used the following query, to start traversal from vertices m2 and m3 simultaneously.

gremlin> g.V().has("Name","m2").id();
==>4200
gremlin> g.V().has("Name","m3").id();
==>4208
gremlin> g.withSack(0,minus).V(4200,4208).repeat(optional(outE().sack(sum).by(constant(1)).inV())).as('endnode').times(10).sack().group().by(select('endnode').values('Name')).unfold();
==>m5=[1]
==>m8=[1]
==>m10=[3, 3]

The expected result should have m10=[1, 1] but it is m10=[3, 3]. It seems that the merge operation is not getting triggered.

I tried the same query but from Vertex m1.

gremlin> g.V().has("Name","m1").id();
==>4256
gremlin> g.withSack(0,minus).V(4256).repeat(optional(outE().sack(sum).by(constant(1)).inV())).as('endnode').times(10).sack().group().by(select('endnode').values('Name')).unfold();
==>m4=[1]
==>m5=[2]
==>m8=[2]
==>m10=[1, 1]

With this, I get the expected result m10=[1, 1]. The merge operator is getting triggered.

I am not trying to perform the exact operation with sack. I used this example just to describe my issue. The actual sack operation is little more complex but I am able to reproduce this same issue.

Is there some other thing that need to be included to make this work?

Thanks in advance.


Solution

  • It looks like there is one key difference between the m2+m3 start example and the m1-only start.

    When you only start from m1, as in:

    gremlin> g.V().has("Name","m1").id();
    ==>4256
    gremlin> g.withSack(0,minus).V(4256).repeat(optional(outE().sack(sum).by(constant(1)).inV())).as('endnode').times(10).sack().group().by(select('endnode').values('Name')).unfold();
    

    there is only 1 traverser entering the repeat, in this case, the merge is happening inside the repeat when both paths reach m9. At this point, both sacs have a value of 3, which when subtracted gives 0 for each. Then when they move on to m10, they are both incremented which gives [1, 1].

    In the 2 starting point example:

    gremlin> g.V().has("Name","m2").id();
    ==>4200
    gremlin> g.V().has("Name","m3").id();
    ==>4208
    gremlin> g.withSack(0,minus).V(4200,4208).repeat(optional(outE().sack(sum).by(constant(1)).inV())).as('endnode').times(10).sack().group().by(select('endnode').values('Name')).unfold();
    

    there are 2 distinct traversers entering into the repeat. Each of these invocations of repeat run independently of each other, and there is no bulking/merging between each of these. In this case, the "m2" invocation of repeat runs in full, and finds that the distance from m2 to m10 is 3. Then the "m3" invocation runs and finds the distance from m3 to m10 is also 3. Then these 2 sacs are both accessed and grouped to form [3, 3]. You can force the 2 m10 traversers which exit the repeat to bulk by adding a barrier() step after the repeat. This merges the 2 3's into 0's after the repeat.

    gremlin> g.withSack(0,minus).V(4200,4208).repeat(optional(outE().sack(sum).by(constant(1)).inV())).as('endnode').times(10).barrier().sack().group().by(select('endnode').values('Name')).unfold();
    ==>m5=[1]
    ==>m8=[1]
    ==>m10=[0, 0]
    

    In my opinion, it is expected that there is no bulking/merging across separate iterations of a repeat. In gremlin each execution of a child traversal is generally independent of each other.