Search code examples
gremlintinkerpoptinkerpop3gremlin-serverazure-cosmosdb-gremlinapi

Order results by number of coincidences in edge properties


I'm working on a recommendation system that recommends other users. The first results should be the most "similar" users to the "searcher" user. Users respond to questions and the amount of questions responded in the same way is the amount of similarity.

The problem is that I don't know how to write the query

So in technical words I need to sort the users by the amount of edges that has specific property values, I tried with this query, I thought it should work but it doesn't work:

   let query = g.V().hasLabel('user');

   let search = __;
   for (const question of searcher.questions) {
      search = search.outE('response')
            .has('questionId', question.questionId)
            .has('answerId', question.answerId)
            .aggregate('x')
            .cap('x')     
   }

   query = query.order().by(search.unfold().count(), order.asc);

Throws this gremlin internal error:

org.apache.tinkerpop.gremlin.process.traversal.step.util.BulkSet cannot be cast to org.apache.tinkerpop.gremlin.structure.Vertex

I also tried with multiple .by() for each question, but the result was not ordered by the amount of coincidence.

How can I write this query?


Solution

  • When you cap() an aggregate() it returns a BulkSet which is a Set that has counts for how many times each object exists in that Set. It behaves like a List when you iterate through it by unrolling each object the associated size of the count. So you get your error because the output of cap('x') is a BulkSet but because you are building search in a loop you are basically just calling outE('response') on that BulkSet and that's not valid syntax as has() expects a graph Element such as a Vertex as indicated by the error.

    I think you would prefer something more like:

    let query = g.V().hasLabel('user').
                  outE('response');
    
    let search = [];
    for (const question of searcher.questions) {
      search.push(has('questionId', question.questionId).
                  has('answerId', question.answerId));
    }
    
    query = query.or(...search).
                  groupCount().
                    by(outV())
                  order(local).by(values, asc)
    

    I may not have the javascript syntax exactly right (and I used spread syntax in my or() to just convey the idea quickly of what needs to happen) but basically the idea here is to filter edges that match your question criteria and then use groupCount() to count up those edges.

    If you need to count users who have no connection then perhaps you could switch to project() - maybe like:

    let query = g.V().hasLabel('user').
                  project('user','count').
                    by();
    
    let search = [];
    for (const question of searcher.questions) {
      search.push(has('questionId', question.questionId).
                  has('answerId', question.answerId));
    }
    
    query = query.by(outE('response').or(...search).count()).
                  order().by('count', asc);
    

    fwiw, I think you might consider a different schema for your data that might make this recommendation algorithm a bit more graph-like. A thought might be to make the question/answer a vertex (a "qa" label perhaps) and have edges go from the user vertex to the "qa" vertex. Then users directly link to the question/answers they gave. You can easily see by way of edges, a direct relationship, which users gave the same question/answer combination. That change allows the query to flow much more naturally when asking the question, "What users answered questions in the same way user 'A' did?"

    g.V().has('person','name','A').
      out('responds').
      in('responds').
      groupCount().
      order(local).by(values)
    

    With that change you can see that we can rid ourselves of all those has() filters because they are implicitly implied by the "responds" edges which encode them into the graph data itself.