I'm using Amazon Neptune to create and query a simple graph database. I'm currently running my code in an AWS Jupyter Notebook but will eventually move the code to Python (gremlin_python). As you can probably guess I'm pretty new to Gremlin and graph databases in general.
I have the following data
g.addV('person').property(id, 'john')
.addV('person').property(id, 'jim')
.addV('person').property(id, 'pam')
.addV('game').property(id, 'G1')
.addV('game').property(id, 'G2')
.addV('game').property(id, 'G3').iterate()
g.V('john').as('p').V('G1').addE('bought').from('p').iterate()
g.V('john').as('p').V('G2').addE('bought').from('p').iterate()
g.V('john').as('p').V('G3').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G1').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G2').addE('bought').from('p').iterate()
g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()
3 persons and 3 games in the database. My goal is, given a person, tell me which persons have bought the same games as them and which games are those
After looking at sample code (mostly from https://tinkerpop.apache.org/docs/current/recipes/#recommendation) I have the following code that tries to find games bought by
g.V('john').as('target') Target person we are interested in comparing against
.out('bought').aggregate('target_games') // Games bought by target
.in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates)
.group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games
.unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games
Which gives me the results:
Which tells me that jim has 2 of the same games as john while pam only has 1. But I want my query to return the actual games they have in common like so (still in order of most common games):
Thanks for your help.
There are a few different ways this query could be written. Here is one way that uses a mid traversal V
step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> V().
......4> hasLabel('person').
......5> where(neq('j')).
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
However, the mid traversal V
approach is not really needed as you can just look at the incoming vertices from the games that Jown owns
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> group().
......6> by(id).
......7> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
Finally, here is a third way, where the dedup
step is applied sooner. This is likely to be the most efficient of the three.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> dedup().
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group
from a projection like this:
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> project('johnGames','personPurchasedJohnGames').
......5> by(select('johnGames')).
......6> by(select('personPurchasedJohnGames')).
......7> group().
......8> by(select('personPurchasedJohnGames')).
......9> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
but actually you can further reduce this to
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> group().
......5> by(select('personPurchasedJohnGames')).
......6> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as
steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j')
in the other queries not really a big deal.
EDITED AGAIN to add ordering of results
g.V('john').as('j').
out().as('johnGames').
in('bought').
where(neq('j')).as('personPurchasedJohnGames').
group().
by(select('personPurchasedJohnGames')).
by(select('johnGames').fold()).
unfold().
order().
by(select(values).count(local),desc)
{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}