Search code examples
gremlintinkerpop

Is there a way to reuse aggregate steps?


I have a graph database storing different types of entities and I am building an API to fetch entities from the graph. It is however a bit more complicated since for each type of entity there is a set of rules that are used for fetching related entities as well as the original. To do this I used an aggregate step to aggregate all related entities that I am fetching into a collection.

An added requirement is to fetch a batch of entities (and related entities). I was going to do this by changing the has step that is fetching entities to use P.within and map the aggregation to each of the found entities. This works if I continue fetching a single entity, but if I want to fetch two then my result set will be correct for the first one but the result set for the second contains the results of the first one as well as its own results. I think this is because second one will simply add to the aggregated collection from the first one since the aggregation key is the same. I haven't found any way to clear the collection between the first and the second, nor any way to have a dynamic aggregation side effect key.

Code:


return graph.traversal().V()
    .hasLabel(ENTITY_LABEL)
    .has("entity_ref", P.within(entityRefs)) // entityRefs is a list of entities I am looking for
    .flatMap(
        __.aggregate("match")
          .sideEffect(
                // The logic that applies the rules lives here. It will add items to "match" collection.
          )
          .select("match")
          .fold()
    )
    .toStream()
    ...

The result should be a list of lists of entities where the first list of entities in the outer list contains results for the first entity in entityRefs, and the second list of entities contains results for the second entity in entityRefs.

Example: I want to fetch the vertices for entity refs A and B and their related entities. Let's say I expect the results to then be [[A, C], [B, D, E]], but I get the results [[A, C], [A, C, B, D, E]] (The second results contain the results from the first one).

Questions:

  1. Is there a way to clear the "match" collection after the selection?
  2. Is there a way to have dynamic side effect keys such that I create a collection for each entityRef?
  3. Is there perhaps a different way I can do this?
  4. Have I misidentified the problem?

EDIT: This is an example that is a miniature version of the problem. The graph is setup like so:

g.addV('entity').property('id',1).property('type', 'customer').as('1').
  addV('entity').property('id',2).property('type', 'email').as('2').
  addV('entity').property('id',6).property('type', 'customer').as('6').
  addV('entity').property('id',3).property('type', 'product').as('3').
  addV('entity').property('id',4).property('type', 'subLocation').as('4').
  addV('entity').property('id',7).property('type', 'location').as('7').
  addV('entity').property('id',5).property('type', 'productGroup').as('5').
  addE('AKA').from('1').to('2').
  addE('AKA').from('2').to('6').
  addE('HOSTED_AT').from('3').to('4').
  addE('LOCATED_AT').from('4').to('7').
  addE('PART_OF').from('3').to('5').iterate()

I want to fetch a batch of entities, given their ids and fetch related entities. Which related entities should be returned is a function of the type of the original entity. My current query is like this (slightly modified for this example):

g.V().
    hasLabel('entity').
    has('id', P.within(1,3)).
    flatMap(
        aggregate('match').
        sideEffect(
            choose(values('type')).
                option('customer',
                    both('AKA').
                        has('type', P.within('email', 'phone')).
                        sideEffect(
                            has('type', 'email').
                                aggregate('match')).
                        both('AKA').
                        has('type', 'customer').
                        aggregate('match')).
                option('product',
                    bothE('HOSTED_AT', 'PART_OF').
                        choose(label()).
                        option('PART_OF',
                            bothV().
                                has('type', P.eq('productGroup')).
                                aggregate('match')).
                        option('HOSTED_AT',
                            bothV().
                                has('type', P.eq('subLocation')).
                                aggregate('match').
                                both('LOCATED_AT').
                                has('type', P.eq('location')).
                                aggregate('match')))
        ).
        select('match').
        unfold().
        dedup().
        values('id').
        fold()
    ).
    toList()

If I only fetch for one entity i get correct results. For id: 1 I get [1,2,6] and for id: 3 I get [3,5,4,7]. However when i fetch for both I get:

==>[3,5,4,7]
==>[3,5,4,7,1,2,6]

The first result is correct, but the second contains the results for both ids.


Solution

  • You can leverage the (not too well documented to be honest but seemingly powerful traversal step) group().by(key).by(value).

    That way you can drop the aggregate() side effect step that is causing you trouble. As an alternative to collect multiple vertices matching some traversal into a list I used union().

    An example that uses the graph you posted(I only included the Customer option for brevity):

    g.V().
        hasLabel('entity').
        has('id', P.within(1,3)).
        <String, List<Entity>>group()
          .by("id")
          .by(choose(values("type"))
            .option('customer', union(
              identity(),
              both('AKA').has('type', 'email'),
              both('AKA').has('type', within('email', 'phone')).both('AKA').has('type', 'customer'))
              .map((traversal) -> new Entity(traversal.get())) //Or whatever business class you have 
              .fold() //This is important to collect all 3 paths in the union together
            .option('product', union()))
         .next()
        
    

    This traversal has the obvious drawback of the code being a bit more verbose. It declares it will step over the 'AKA' from a Customer twice. Your traversal only declared it once.

    It does however keep the by(value) part of the group() step separate between different keys. Which is what we wanted.