I'm having difficult time figuring out query in gremlin for the following scenario. Here is the the directed graph (may be cyclic).
I want to get top N favorable nodes, starting from node "Jane", where favor is defined as:
favor(Jane->Lisa) = edge(Jane,Lisa) / total weight from outwards edges of Lisa
favor(Jane->Thomas) = favor(Jane->Thomas) + favor(Jane->Lisa) * favor(Lisa->Thomas)
favor(Jane->Jerryd) = favor(Jane->Thomas) * favor(Thomas->Jerryd) + favor(Jane->Lisa) * favor(Lisa->Jerryd)
favor(Jane->Jerryd) = [favor(Jane->Thomas) + favor(Jane->Lisa) * favor(Lisa->Thomas)] * favor(Thomas->Jerryd) + favor(Jane->Lisa) * favor(Lisa->Jerryd)
and so .. on
Here is same graph with hand calculation of what I mean,
This is fairly simple to transferse with programming but I'm not sure, how ecactly to query it with gremlin or even sparql.
Here is the query to create this example graph:
g
.addV('person').as('1').property(single, 'name', 'jane')
.addV('person').as('2').property(single, 'name', 'thomas')
.addV('person').as('3').property(single, 'name', 'lisa')
.addV('person').as('4').property(single, 'name', 'wyd')
.addV('person').as('5').property(single, 'name', 'jerryd')
.addE('favor').from('1').to('2').property('weight', 10)
.addE('favor').from('1').to('3').property('weight', 20)
.addE('favor').from('3').to('2').property('weight', 90)
.addE('favor').from('2').to('4').property('weight', 50)
.addE('favor').from('2').to('5').property('weight', 90)
.addE('favor').from('3').to('5').property('weight', 100)
All I'm looking for is:
[Lisa, computedFavor]
[Thomas, computedFavor]
[Jerryd, computedFavor]
[Wyd, computedFavor]
I'm struggling to incooperate cyclic graph to adjust weight. This is where I've been able to query so far: https://gremlify.com/f2r0zy03oxc/2
g.V().has('name','jane'). // our starting node
repeat(
union(
outE() // get only outwards edges
).
otherV().simplePath()). // produce simple path
emit().
times(10). // max depth of 10
path(). // attain path
by(valueMap())
Addressing Comments from stephen mallette:
favor(Jane->Jerryd) =
favor(Jane->Thomas) * favor(Thomas->Jerryd)
+ favor(Jane->Lisa) * favor(Lisa->Jerryd)
// note we can expand on favor(Jane->Thomas) in above expression
//
// favor(Jane->Thomas) is favor(Jane->Thomas)@directEdge +
// favor(Jane->Lisa) * favor(Lisa->Thomas)
//
Calculation Example
Jane to Lisa => 20/(10+20) => 2/3
Lisa to Jerryd => 100/(100+90) => 10/19
Jane to Lisa to Jerryd => 2/3*(10/19)
Jane to Thomas (directly) => 10/(10+20) => 1/3
Jane to Lisa to Thomas => 2/3 * 90/(100+90) => 2/3 * 9/19
Jane to Thomas => 1/3 + (2/3 * 9/19)
Thomas to Jerryd => 90/(90+50) => 9/14
Jane to Thomas to Jerryd => [1/3 + (2/3 * 9/19)] * (9/14)
Jane to Jerryd:
= Jane to Lisa to Jerryd + Jane to Thomas to Jerryd
= 2/3 * (10/19) + [1/3 + (2/3 * 9/19)] * (9/14)
Here is somewhat of psedocode:
def get_favors(graph, label="jane", starting_favor=1):
start = graph.findNode(label)
queue = [(start, starting_favor)]
favors = {}
seen = set()
while queue:
node, curr_favor = queue.popleft()
# get total weight (out edges) from this node
total_favor = 0
for (edgeW, outNode) in node.out_edges:
total_favor = total_favor + edgeW
for (edgeW, outNode) in node.out_edges:
# if there are no favors for this node
# take current favor and provide proportional favor
if outNode not in favors:
favors[outNode] = curr_favor * (edgeW / total_favor)
# it already has some favor, so we add to it
# we add proportional favor
else:
favors[outNode] += curr_favor * (edgeW / total_favor)
# if we have seen this edge, and node ignore
# otherwise, transverse
if (edgeW, outNode) not in seen:
seen.add((edgeW, outNode))
queue.append((outNode, favors[outNode]))
# sort favor by value and return top X
return favors
Here is a Gremlin query that I believe applies your formula correctly. I'll paste the full final query first then say a few words about the steps involved.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> until(has('name','jerryd')).
.....11> sack().
.....12> sum()
==>0.768170426065163
The query starts with Jane and keeps traversing until all paths to Jerry D have been inspected. Along the way for each traverser a sack
is maintained containing the calculated weight values for each relationship multiplied together. The calculation on line 6 finds all the edge weight values possible coming from the prior vertex and the math
step on line 7 is used to divide the weight on the current edge by that sum. At the very end each of the computed results is added together on line 12. If you remove the final sum
step you can see the intermediate results.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> until(has('name','jerryd')).
.....11> sack()
==>0.2142857142857143
==>0.3508771929824561
==>0.2030075187969925
To see the routes taken a path
step can be added to the query.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> until(has('name','jerryd')).
.....11> local(
.....12> union(
.....13> path().
.....14> by('name').
.....15> by('weight'),
.....16> sack()).fold())
==>[[jane,10,thomas,90,jerryd],0.2142857142857143]
==>[[jane,20,lisa,100,jerryd],0.3508771929824561]
==>[[jane,20,lisa,90,thomas,90,jerryd],0.2030075187969925]
This approach also takes account of adding in any direct connections, per your formula as we can see if we use Thomas as the target.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> until(has('name','thomas')).
.....11> local(
.....12> union(
.....13> path().
.....14> by('name').
.....15> by('weight'),
.....16> sack()).fold())
==>[[jane,10,thomas],0.3333333333333333]
==>[[jane,20,lisa,90,thomas],0.3157894736842105]
These extra steps are not needed but having the path
included is useful when debugging queries like this. Also, and this is not necessary but perhaps just for general interest, I will add that you can also get to the final answer from here but the very first query I included is all you really need.
g.withSack(1).V().
has('name','jane').
repeat(outE().
sack(mult).
by(project('w','f').
by('weight').
by(outV().outE().values('weight').sum()).
math('w / f')).
inV().
simplePath()).
until(has('name','thomas')).
local(
union(
path().
by('name').
by('weight'),
sack()).fold().tail(local)).
sum()
==>0.6491228070175439
If any of this is unclear or I have mis-understood the formula, please let me know.
EDITED to add
To find the results for all people reachable from Jane I had to modify the query a little bit. The unfold
at the end is just to make the results easier to read.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> emit().
.....11> local(
.....12> union(
.....13> path().
.....14> by('name').
.....15> by('weight').unfold(),
.....16> sack()).fold()).
.....17> group().
.....18> by(tail(local,2).limit(local,1)).
.....19> by(tail(local).sum()).
.....20> unfold()
==>jerryd=0.768170426065163
==>wyd=0.23182957393483708
==>lisa=0.6666666666666666
==>thomas=0.6491228070175439
The final group
step on line 17 uses the path
results to calculate the total favor for each unique name found. To see the paths you can run the query with the group
step removed.
gremlin> g.withSack(1).V().
......1> has('name','jane').
......2> repeat(outE().
......3> sack(mult).
......4> by(project('w','f').
......5> by('weight').
......6> by(outV().outE().values('weight').sum()).
......7> math('w / f')).
......8> inV().
......9> simplePath()).
.....10> emit().
.....11> local(
.....12> union(
.....13> path().
.....14> by('name').
.....15> by('weight').unfold(),
.....16> sack()).fold())
==>[jane,10,thomas,0.3333333333333333]
==>[jane,20,lisa,0.6666666666666666]
==>[jane,10,thomas,50,wyd,0.11904761904761904]
==>[jane,10,thomas,90,jerryd,0.2142857142857143]
==>[jane,20,lisa,90,thomas,0.3157894736842105]
==>[jane,20,lisa,100,jerryd,0.3508771929824561]
==>[jane,20,lisa,90,thomas,50,wyd,0.11278195488721804]
==>[jane,20,lisa,90,thomas,90,jerryd,0.2030075187969925]