Search code examples
graph-databasesgremlin

gremlin syntax to calculate Jaccard similarity metric


I'm interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of the intersection of the neighbors of the two vertices divided by the norm of the union of the same sets.

where

So far I have been able to get all pairs of nodes not directly connected (only interested in this case for link prediction, if a direct link already exists then I do not need to calculate the Jaccard metric) such that for a pair (x, y) where x not equals y:

g.V().as('v1').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1'))))

In addition to that I also include the neighbors for each pair member labeled v1out and v2out:

g.V().as('v1').out().as('v1out').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1')))).out().as('v2out')

From here how would I perform the set operations to get the number of elements in the intersection and union of the two neighbor sets? After that I believe I can append the math step (currently using TinkerPop 3.4.0) to calculate the Jaccard similarity ratio followed by a choose step to add an edge when the value is greater than a threshold. If a completely different approach has benefits over the partial solution above I would be happy to adopt it and finally get this working.


Solution

  • Let's do it step by step:

    Find pairs of vertices and also collect their respective neighbors:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2'))
    

    Make sure that v1 is not a neighbor of v2 and vice versa:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n'))
    

    Next, compute the number of intersecting neighbors and the total number of neighbors:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count())
    

    And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count()).
      filter(select('u').is(gt(0))).
      project('v1','v2','j').
        by(select('v1')).
        by(select('v2')).
        by(math('i/u'))
    

    One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', lt('v2')).
          by(id).
        where('v1', without('v2n')).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count()).
      filter(select('u').is(gt(0))).
      project('v1','v2','j').
        by(select('v1')).
        by(select('v2')).
        by(math('i/u'))
    

    Executing this traversal over the modern toy graph yields the following result:

    gremlin> g = TinkerFactory.createModern().traversal()
    ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
    gremlin> g.V().match(
    ......1>       __.as('v1').out().dedup().fold().as('v1n'),
    ......2>       __.as('v1').V().as('v2'),
    ......3>       __.as('v2').out().dedup().fold().as('v2n')).
    ......4>     where('v1', lt('v2')).
    ......5>       by(id).
    ......6>     where('v1', without('v2n')).
    ......7>     where('v2', without('v1n')).as('m').
    ......8>   project('v1','v2','i','u').
    ......9>     by(select('v1')).
    .....10>     by(select('v2')).
    .....11>     by(select('v1n').as('n').
    .....12>        select('m').
    .....13>        select('v2n').unfold().
    .....14>          where(within('n')).
    .....15>        count()).
    .....16>     by(union(select('v1n'),
    .....17>              select('v2n')).unfold().
    .....18>        dedup().count()).
    .....19>   filter(select('u').is(gt(0))).
    .....20>   project('v1','v2','j').
    .....21>     by(select('v1')).
    .....22>     by(select('v2')).
    .....23>     by(math('i/u'))
    ==>[v1:v[1],v2:v[5],j:0.0]
    ==>[v1:v[1],v2:v[6],j:0.3333333333333333]
    ==>[v1:v[2],v2:v[4],j:0.0]
    ==>[v1:v[2],v2:v[6],j:0.0]
    ==>[v1:v[4],v2:v[6],j:0.5]
    ==>[v1:v[5],v2:v[6],j:0.0]