Search code examples
gremlintinkerpop3amazon-neptunegremlinpython

Neptune - How to get distance to all nodes with proportional weights gremlin


I'm having difficult time figuring out query in gremlin for the following scenario. Here is the the directed graph (may be cyclic).

enter image description here

I want to get top N favorable nodes, starting from node "Jane", where favor is defined as:

favor(Jane->Lisa) = edge(Jane,Lisa) / total weight from outwards edges of Lisa
favor(Jane->Thomas) = favor(Jane->Thomas) + favor(Jane->Lisa) * favor(Lisa->Thomas)

favor(Jane->Jerryd) = favor(Jane->Thomas) * favor(Thomas->Jerryd) + favor(Jane->Lisa) * favor(Lisa->Jerryd)

favor(Jane->Jerryd) = [favor(Jane->Thomas) + favor(Jane->Lisa) * favor(Lisa->Thomas)] * favor(Thomas->Jerryd) + favor(Jane->Lisa) * favor(Lisa->Jerryd)


and so .. on

Here is same graph with hand calculation of what I mean,

enter image description here

This is fairly simple to transferse with programming but I'm not sure, how ecactly to query it with gremlin or even sparql.

Here is the query to create this example graph:

g
.addV('person').as('1').property(single, 'name', 'jane')
.addV('person').as('2').property(single, 'name', 'thomas')
.addV('person').as('3').property(single, 'name', 'lisa')
.addV('person').as('4').property(single, 'name', 'wyd')
.addV('person').as('5').property(single, 'name', 'jerryd')
.addE('favor').from('1').to('2').property('weight', 10)
.addE('favor').from('1').to('3').property('weight', 20)
.addE('favor').from('3').to('2').property('weight', 90)
.addE('favor').from('2').to('4').property('weight', 50)
.addE('favor').from('2').to('5').property('weight', 90)
.addE('favor').from('3').to('5').property('weight', 100)

All I'm looking for is:

[Lisa, computedFavor]
[Thomas, computedFavor]
[Jerryd, computedFavor]
[Wyd, computedFavor]

I'm struggling to incooperate cyclic graph to adjust weight. This is where I've been able to query so far: https://gremlify.com/f2r0zy03oxc/2

g.V().has('name','jane').       // our starting node
   repeat(                      
      union(                    
         outE()                 // get only outwards edges
      ).
      otherV().simplePath()).   // produce simple path
   emit().  
   times(10).                   // max depth of 10
   path().                      // attain path
   by(valueMap())

Addressing Comments from stephen mallette:

favor(Jane->Jerryd) = 
    favor(Jane->Thomas) * favor(Thomas->Jerryd) 
  + favor(Jane->Lisa) * favor(Lisa->Jerryd)

// note we can expand on favor(Jane->Thomas) in above expression
// 
// favor(Jane->Thomas) is favor(Jane->Thomas)@directEdge +
//                        favor(Jane->Lisa) * favor(Lisa->Thomas)
//

Calculation Example

Jane to Lisa                   => 20/(10+20)         => 2/3
Lisa to Jerryd                 => 100/(100+90)       => 10/19
Jane to Lisa to Jerryd         => 2/3*(10/19)

Jane to Thomas (directly)      => 10/(10+20)         => 1/3
Jane to Lisa to Thomas         => 2/3 * 90/(100+90)  => 2/3 * 9/19
Jane to Thomas                 => 1/3 + (2/3 * 9/19)

Thomas to Jerryd               => 90/(90+50)         => 9/14
Jane to Thomas to Jerryd       => [1/3 + (2/3 * 9/19)] * (9/14)

Jane to Jerryd:
= Jane to Lisa to Jerryd + Jane to Thomas to Jerryd
= 2/3 * (10/19) + [1/3 + (2/3 * 9/19)] * (9/14)

Here is somewhat of psedocode:

def get_favors(graph, label="jane", starting_favor=1):
  start = graph.findNode(label)
  queue = [(start, starting_favor)]
  favors = {}
  seen = set()
  
  while queue:
    node, curr_favor = queue.popleft()

    # get total weight (out edges) from this node
    total_favor = 0
    for (edgeW, outNode) in node.out_edges:
       total_favor = total_favor + edgeW

    for (edgeW, outNode) in node.out_edges:
    
       # if there are no favors for this node
       # take current favor and provide proportional favor
       if outNode not in favors:
          favors[outNode] = curr_favor * (edgeW / total_favor)

       # it already has some favor, so we add to it
       # we add proportional favor
       else:
          favors[outNode] += curr_favor * (edgeW / total_favor)

       # if we have seen this edge, and node ignore
       # otherwise, transverse
    
       if (edgeW, outNode) not in seen:
          seen.add((edgeW, outNode))
          queue.append((outNode, favors[outNode]))

   # sort favor by value and return top X
   return favors


Solution

  • Here is a Gremlin query that I believe applies your formula correctly. I'll paste the full final query first then say a few words about the steps involved.

    gremlin> g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    until(has('name','jerryd')).
    .....11>    sack().
    .....12>    sum()     
    
    ==>0.768170426065163         
    

    The query starts with Jane and keeps traversing until all paths to Jerry D have been inspected. Along the way for each traverser a sack is maintained containing the calculated weight values for each relationship multiplied together. The calculation on line 6 finds all the edge weight values possible coming from the prior vertex and the math step on line 7 is used to divide the weight on the current edge by that sum. At the very end each of the computed results is added together on line 12. If you remove the final sum step you can see the intermediate results.

    gremlin> g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    until(has('name','jerryd')).
    .....11>    sack()
    
    ==>0.2142857142857143
    ==>0.3508771929824561
    ==>0.2030075187969925   
    

    To see the routes taken a path step can be added to the query.

    gremlin> g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    until(has('name','jerryd')).
    .....11>    local(
    .....12>      union(
    .....13>        path().
    .....14>          by('name').
    .....15>          by('weight'),
    .....16>        sack()).fold()) 
    
    ==>[[jane,10,thomas,90,jerryd],0.2142857142857143]
    ==>[[jane,20,lisa,100,jerryd],0.3508771929824561]
    ==>[[jane,20,lisa,90,thomas,90,jerryd],0.2030075187969925]   
    

    This approach also takes account of adding in any direct connections, per your formula as we can see if we use Thomas as the target.

    gremlin>  g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    until(has('name','thomas')).
    .....11>    local(
    .....12>      union(
    .....13>        path().
    .....14>          by('name').
    .....15>          by('weight'),
    .....16>        sack()).fold())    
    
    ==>[[jane,10,thomas],0.3333333333333333]
    ==>[[jane,20,lisa,90,thomas],0.3157894736842105]  
    

    These extra steps are not needed but having the path included is useful when debugging queries like this. Also, and this is not necessary but perhaps just for general interest, I will add that you can also get to the final answer from here but the very first query I included is all you really need.

    g.withSack(1).V().
       has('name','jane').
       repeat(outE().
              sack(mult).
                by(project('w','f').
                  by('weight').
                  by(outV().outE().values('weight').sum()).
                  math('w / f')).
              inV().
              simplePath()).
       until(has('name','thomas')).
       local(
         union(
           path().
             by('name').
             by('weight'),
           sack()).fold().tail(local)).  
        sum() 
      
    ==>0.6491228070175439  
    

    If any of this is unclear or I have mis-understood the formula, please let me know.

    EDITED to add

    To find the results for all people reachable from Jane I had to modify the query a little bit. The unfold at the end is just to make the results easier to read.

    gremlin> g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    emit().
    .....11>    local(
    .....12>      union(
    .....13>        path().
    .....14>          by('name').
    .....15>          by('weight').unfold(),
    .....16>        sack()).fold()).
    .....17>        group().
    .....18>          by(tail(local,2).limit(local,1)).
    .....19>          by(tail(local).sum()).
    .....20>        unfold()
    
    ==>jerryd=0.768170426065163
    ==>wyd=0.23182957393483708
    ==>lisa=0.6666666666666666
    ==>thomas=0.6491228070175439    
    

    The final group step on line 17 uses the path results to calculate the total favor for each unique name found. To see the paths you can run the query with the group step removed.

    gremlin> g.withSack(1).V().
    ......1>    has('name','jane').
    ......2>    repeat(outE().
    ......3>           sack(mult).
    ......4>             by(project('w','f').
    ......5>               by('weight').
    ......6>               by(outV().outE().values('weight').sum()).
    ......7>               math('w / f')).
    ......8>           inV().
    ......9>           simplePath()).
    .....10>    emit().
    .....11>    local(
    .....12>      union(
    .....13>        path().
    .....14>          by('name').
    .....15>          by('weight').unfold(),
    .....16>        sack()).fold())
    
    ==>[jane,10,thomas,0.3333333333333333]
    ==>[jane,20,lisa,0.6666666666666666]
    ==>[jane,10,thomas,50,wyd,0.11904761904761904]
    ==>[jane,10,thomas,90,jerryd,0.2142857142857143]
    ==>[jane,20,lisa,90,thomas,0.3157894736842105]
    ==>[jane,20,lisa,100,jerryd,0.3508771929824561]
    ==>[jane,20,lisa,90,thomas,50,wyd,0.11278195488721804]
    ==>[jane,20,lisa,90,thomas,90,jerryd,0.2030075187969925]