Search code examples
gremlintinkerpop3

Gremlin, linking an edge to a vertex via property


In a graph database, I have graphs like:

v1: Protein{prefName: 'QP1'} 
  -- r1: part_of{evidence: 'ns:testdb'} 
  --> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }

I'd like to write a Gremlin query to fetch instances of the part_of relationship and return v1 and v2's prefName, along with the evidence's label. So far I've tried this:

g.V().hasLabel( containing('Protein') ).as('p')
  .outE().hasLabel( 'is_part_of' ).as('pr')
  .inV().hasLabel( containing('Protcmplx') ).as('cpx')
.V().hasLabel( containing('EvidenceType') ).as('ev')
  .has( 'iri', eq( select('pr').by('evidence') ) )
.select( 'p', 'cpx', 'ev', 'pr' )
  .by('prefName')
  .by('prefName')
  .by('label')
  .by('evidence')
.limit(100)

But it takes a lot of time for a few thousand nodes+edeges, and eventually, it doesn't return anything. I'm sure the values are there and I think the problem is with has( 'iri', ... ), but I can't figure out how to match an edge property with another property in an unconnected vertex.

The graph is modelled this way, cause the LPG model doesn't allow for hyper-edges (linking >2 vertices).


Solution

  • I've found a way using where() and by(), but it is quite slow (11secs to get 100 tuples from a few thousands nodes+edges):

    g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
      .inE().hasLabel ( 'is_part_of' ).limit ( 10 ).as ( 'pr' )
      .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )  
    .V().hasLabel ( containing ( 'EvidenceType' ) ).as ( 'ev' )
        .where ( 'ev', eq ( 'pr' ) ).by ( 'iri' ).by ( 'evidence' ) 
    .select ( 'p', 'cpx', 'ev' )
    .by ( 'prefName' )
    .by ( 'prefName' )
    .by ( 'label' )
    

    Any help with optimisation would be welcome!

    EDIT: following a suggestion from the comments (thanks!), I've rewritten the solution a bit (it's still slow) and used .profile() at the end, obtaining this:

    Traversal Metrics
    Step                                                               Count  Traversers       Time (ms)    % Dur
    =============================================================================================================
    GraphStep(vertex,[])                                              123591      123591         507.179     9.09
    HasStep([~label.containing(Protcmplx)])@[cpx]                         10          10          34.313     0.61
    VertexStep(IN,[is_part_of],edge)@[pr]                                 13          13           5.089     0.09
    RangeGlobalStep(0,10)                                                 10          10           0.094     0.00
    EdgeVertexStep(OUT)                                                   10          10           1.618     0.03
    HasStep([~label.containing(Protein)])@[p]                             10          10           0.065     0.00
    GraphStep(vertex,[])                                             1738360     1738360        4574.578    81.99
    HasStep([~label.containing(EvidenceType)])@[ev]                      510         510         447.546     8.02
    WherePredicateStep(ev,eq(pr),[value(iri), value...                    10          10           6.747     0.12
    NoOpBarrierStep(2500)                                                 10          10           1.444     0.03
    SelectStep(last,[p, cpx, ev],[value(prefName), ...                    10          10           0.154     0.00
    NoOpBarrierStep(2500)                                                 10           8           0.785     0.01
                                                >TOTAL                     -           -        5579.617        -
    

    So, the problem seems to be that the second V() picks up all the vertexes before the filters from the former traversal (on the where) can be applied. However, I can't find a way to avoid this. Does Gremlin have subqueries?

    EDIT/2: inspired by the suggestion in the comments to use two separated queries (thanks!), I've tried this:

    evLabels = [:]
    g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
      // Trying to put the limit early-on
      .inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
      .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )
    .select ( 'p', 'cpx', 'pr' )
      .by ( 'prefName' )
      .by ( 'prefName' )
      .by ( map{
        pr = it.get()
        evIri = pr.values ( 'evidence' ).next ();
        lbl = evLabels [ evIri ];
        if ( lbl != null ) return lbl;
        lbl = g.V().hasLabel ( containing ( 'EvidenceType' ) )
                 .has ( 'iri', evIri )
                 .values ( 'label' ).next ();
        evLabels [ evIri ] = lbl == null ? "" : lbl;
        return lbl;
      })
    

    Which avoids a full cartesian product join by accumulating sub-query results into a map. This is much faster than the original query (like <1s for 100 edges), but not very simple to read, I'm sure there is a better way to write the same.