Search code examples
filtergremlin

Gremlin Sack Sum Once Per Distinct Value


Referencing this example from Practical Gremlin and this stack overflow post:

Gremlin Post Filter Path

  g.withSack(0).V().
  has('code','AUS').
  repeat(out().simplePath().has('country',within('US','UK')).
         choose(has('code','MAN'),sack(sum).by(constant(1)))).
    until(has('code','EDI')).
  where(sack().is(1)).
  path().by('code').
  limit(10)

Is it possible to perform a sack sum in such a way as to only sum the first time a property is found. For instance, instead of the 'code' property inside of the choose() which will only sum once per 'code' encountered thanks to the simplePath(), what if there was another property called 'airport_color'. As we perform the traversal, I would only want the sack sum to increment the first time it encountered 'blue' or 'white' as an example, even though multiple airports could have the same color as we go through the traversal. This would help me in the where() clause because, if I had a couple of colors I was interested in looking for as an example (maybe blue and white), I could set the where() clause to be equal to two and know that two wasn't arrived at just because I passed through blue twice but because blue and white was encountred.

I tried using aggregation to make the sack sum increment only on the first encounter but couldn't get it to work, something like this:

g.withSack(0).V().
has('code','AUS').
repeat(out().simplePath().has('country',within('US','UK')).
     choose(has('airport_color','blue').has('airport_color', without('airport_color_agg')),sack(sum).by(constant(1))).
     aggregate('airport_color_agg').by('airport_color')).
until(has('code','EDI')).
where(sack().is(1)).
path().by('code').
limit(10)

There could be multiple colors in the choose() via or() but I limited it to just one to keep the example more straightforward.

Thanks for your help!


Solution

  • Using the sample graph below:

    g.addV('A').property(id,'a1').property('color','red').as('a1').
      addV('A').property(id,'a2').property('color','blue').as('a2').
      addV('A').property(id,'a3').property('color','red').as('a3').
      addV('A').property(id,'a4').property('color','yellow').as('a4').
      addV('A').property(id,'a5').property('color','green').as('a5').
      addV('A').property(id,'a6').property('color','blue').as('a6').
      addE('R').from('a1').to('a2').
      addE('R').from('a2').to('a3').
      addE('R').from('a3').to('a4').
      addE('R').from('a4').to('a5').
      addE('R').from('a5').to('a6')
    

    we can inspect the path through the graph as follows:

    g.V('a1').
      repeat(out()).
      until(not(out())).
      path().
        by('color')
    

    which shows us the colors found

    path[red, blue, red, yellow, green, blue]
    

    the next thing we need to do is to remove the duplicates and filter out the colors not in our want list.

    g.withSideEffect('want',['red','green']).
      V('a1').
      repeat(out()).
      until(not(out())).
      path().
        by('color').
      dedup(local).
      unfold().
      where(within('want'))
    

    which gives us:

    red
    green
    

    finally, we just need to count them:

    g.withSideEffect('want',['red','green']).
      V('a1').
      repeat(out()).
      until(not(out())).
      path().
        by('color').
      dedup(local).
      unfold().
      where(within('want')).
      count()
    

    Which, as expected, gives us:

    2
    

    UPDATED 2022-09-01 To reflect discussion in comments.

    To change the query so that only paths that visit each of the required colors at least once are returned, the previous steps leading up to the count need to be turned into a filter.

    g.withSideEffect('want',['red','green']).
      V('a1').
      repeat(out()).
      until(not(out())).
      filter(
        path().
          by('color').
        dedup(local).
        unfold().
        where(within('want')).
        count().is(2)).
      path()
    

    which for our sample graph returns:

    1   path[v[a1], v[a2], v[a3], v[a4], v[a5], v[a6]]
    

    The query as written gets the job done and hopefully is not too hard to follow. There are some things we could change/improve.

    1. Pass in the count for the is step as another parameter like the want list.
    2. Rather than pass in a parameter, use the size of the want list instead. This makes the query a little more complex.
    3. Use a sack as the query proceeds to collect the colors seen. The query gets more complex in that case as, while you can maintain a list in a sack, updating it requires a few extra steps.

    Here is the query rewritten to use a sack. This assumes the start node has a color that should be included. If that is not the case, the first sack(assign) can be removed and a withSack([]) added after the withSideEffect.

    g.withSideEffect('want',['red','green']).
      V('a1').
      sack(assign).by(values('color').fold()).
      repeat(out().sack(assign).by(union(sack().unfold(),values('color')).fold())).
      until(not(out())).
      filter(
        sack().
        unfold().
        dedup().
        where(within('want')).
        count().is(2)).
      path()