Search code examples
cassandragremlintinkerpopdatastax-enterprise-graph

Tinkerpop Gremlin - How to aggregate variables into traversal independant collections


I'm currently reading The Practitioner's Guide to Graph Data and am trying to solve the following problem (just for learning purposes). The following is in the context of the books movie dataset, which in this example makes use of a "Tag" vertex, a "Movie" vertex and a "rated" edge which has a rating property of a value 1-5 .

Just for practice, and to extend my understanding of concepts from the book, I would like to get all movies tagged with "comedy" and calculate the mean NPS. To do this, I want to aggregate all positive (+1) and neutral or negative (-1) ratings into a list. Then I wish to divide the sum of these values by the amount of variables in this list (the mean). This is what I attempted:

dev.withSack{[]}{it.clone()}.    // create a sack with an empty list that clones when split
V().has('Tag', 'tag_name', 'comedy').
    in('topic_tagged').as('film').    // walk to movies tagged as comedy
    inE('rated').    // walk to the rated edges
        choose(values('rating').is(gte(3.0)),
            sack(addAll).by(constant([1.0])),
            sack(addAll).by(constant([-1.0]))).    // add a value or 1 or -1 to this movies list, depending on the rating
    group().
        by(select('film').values('movie_title')).
        by(project('a', 'b').
            by(sack().unfold().sum()).    // add all values from the list
            by(sack().unfold().count()).    // Count the values in the list
            math('a / b')).
    order(local).
        by(values, desc)

This ends up with each movie either being "1.0" or "-1.0".

"Journey of August King The (1995)": "1.0",
"Once Upon a Time... When We Were Colored (1995)": "1.0", ...

In my testing, it seems the values aren't aggregating into the collection how I expected. I've tried various approaches but none of them achieve my expected result.

I am aware that I can achieve this result by adding and subtracting from a sack with an initial value of "0.0", then dividing by the edge count, but I am hoping for a more efficient solution by using a list and avoiding an additional traversal to the edges to get the count.

Is it possible to achieve my result using a list? If so, how?

Edit 1:

The much simpler code below, taken from Kelvins example, will aggregate each rating by simply using the fold step:

dev.V().
    has('Tag', 'tag_name', 'comedy').
        in('topic_tagged').
        project('movie', 'result').
            by('movie_title').
            by(inE('rated').
                choose(values('rating').is(gte(3.0)),
                    constant(1.0),
                    constant(-1.0)).
                fold())    // replace fold() with mean() to calculate the mean, or do something with the collection

I feel a bit embarrassed that I completely forgot about the fold step, as folding and unfolding are so common. Overthinking, I guess.


Solution

  • You might consider a different approach using aggregate rather than sack. You can also use the mean step to avoid needing the math step. As I don't have your data I made an example that uses the air-routes data set and uses the airport elevation instead of the movie rating in your case.

    gremlin> g.V().hasLabel('airport').limit(10).values('elev')
    ==>1026
    ==>151
    ==>542
    ==>599
    ==>19
    ==>143
    ==>14
    ==>607
    ==>64
    ==>313  
    

    Using a weighting system similar to yours yields

    gremlin> g.V().hasLabel('airport').limit(10).
    ......1>   choose(values('elev').is(gt(500)),
    ......2>     constant(1),
    ......3>     constant(-1))
    ==>1
    ==>-1
    ==>1
    ==>1
    ==>-1
    ==>-1
    ==>-1
    ==>1
    ==>-1
    ==>-1    
    

    Those results can be aggregated into a bulk set

    gremlin> g.V().hasLabel('airport').limit(10).
    ......1>   choose(values('elev').is(gt(500)),
    ......2>     constant(1),
    ......3>     constant(-1)).
    ......4>   aggregate('x').
    ......5>   cap('x')
    ==>[1,1,1,1,-1,-1,-1,-1,-1,-1]  
    

    From there we can take the mean value

    gremlin> g.V().hasLabel('airport').limit(10).
    ......1>   choose(values('elev').is(gt(500)),
    ......2>     constant(1),
    ......3>     constant(-1)).
    ......4>   aggregate('x').
    ......5>   cap('x').
    ......6>   unfold().
    ......7>   mean()
    ==>-0.2    
    

    Now, this is of course contrived as you would not usually do the aggregate('x').cap('x').unfold().mean() you would just use mean() by itself. However using this pattern you should be able to solve your problem.

    EDITED TO ADD

    Thinking about this more you can probably write the query without even needing an aggregate - something like this (below). I used the air route distance edge property to simulate something similar to your query. The example just uses one airport to keep it simple. First just creating the list of scores...

    gremlin> g.V().has('airport','code','SAF').
    ......1>   project('airport','mean').
    ......2>     by('code').
    ......3>     by(outE().
    ......4>        choose(values('dist').is(gt(350)),
    ......5>          constant(1),
    ......6>          constant(-1)).
    ......7>          fold())
    ==>[airport:SAF,mean:[1,1,1,-1]]   
    

    and finally creating the mean value

    gremlin> g.V().has('airport','code','SAF').
    ......1>   project('airport','mean').
    ......2>     by('code').
    ......3>     by(outE().
    ......4>        choose(values('dist').is(gt(350)),
    ......5>          constant(1),
    ......6>          constant(-1)).
    ......7>          mean())
    ==>[airport:SAF,mean:0.5]
    

    Edited again

    If the edge property may not exist, you can do something like this...

    gremlin> g.V().has('airport','code','SAF').
    ......1>   project('airport','mean').
    ......2>     by('code').
    ......3>     by(outE().
    ......4>        coalesce(values('x'),constant(100)).
    ......5>        choose(identity().is(gt(350)),
    ......6>          constant(1),
    ......7>          constant(-1)).
    ......8>          fold())
    ==>[airport:SAF,mean:[-1,-1,-1,-1]]