Search code examples
gremlingraph-databasesgraph-notebook

Graph Databases: Retrieving the most complex Relationships using Gremlin


I'm trying to write a Gremlin query to find a list of traversed vertices and edges (with their properties), returning the most complex (i.e. highest count) of a vertex based on the starting vertex.

In other words, I want to retrieve the patients with the most codes, but there is not a direct relationship between Patients and Codes. This is the relationship and direction: Patient->Diagnosis<-Code

Here is my attempt:

g.V().hasLabel('Patient'). 
  outE().inV().
  inE().outV().
  path().
    by(elementMap()).
  order().
    by(count(local), asc).
  tail(2).
  unfold().
  toList()

I wanted this to return patient vertices with their traversed edges/vertices, only the top 2 based on the count of codes returned per patient. This is what I got:

single patient vertex with traversed edges/nodes

Here is sample insert to replicate the same relationships:

g
.addV('pat').property(id, 'p-0')
.addV('pat').property(id, 'p-1')
.addV('pat').property(id, 'p-2')
.addV('diag').property(id, 'd-0')
.addV('diag').property(id, 'd-1')
.addV('diag').property(id, 'd-2')
.addV('code').property(id, 'c-0')
.addV('code').property(id, 'c-1')
.V('p-0').addE('contracted').to(V('d-0'))
.V('p-0').addE('contracted').to(V('d-1'))
.V('p-0').addE('contracted').to(V('d-2'))
.V('p-1').addE('contracted').to(V('d-1'))
.V('p-2').addE('contracted').to(V('d-2'))
.V('c-0').addE('includes').to(V('d-0'))
.V('c-1').addE('includes').to(V('d-0'))
.V('c-1').addE('includes').to(V('d-1'))
.V('c-2').addE('includes').to(V('d-1'))

This is an example of the format I would like to return: enter image description here I used ".path().by(elementMap()).unfold().toList()" after the vertex and edge steps to get this.

I want the output to be the vertices and edges that will produce a graph like this: enter image description here

As you can see, out of three patients, I want to return the top 2 most complex patients (based on the number of codes their diagnoses have). I don't want to return the patient with just one code.


Solution

  • Thanks for providing the sample graph. That really helps. Using this query helps in just seeing the graph visually.

    g.V().hasLabel('pat').
      outE().inV().
      inE().outV().
      simplePath().
      path().by(elementMap())
    

    Which, using graph-notebook, produces:

    enter image description here

    To find the number of codes for each starting patient, we might do this. It builds on the prior query but filters using edge labels.

    g.V().hasLabel('pat').as('p').
      out('contracted').
      group().
        by(select('p').id()).
        by(in('includes').count())
    

    which will give us the codes associated with each patient

    {'p-0': 3, 'p-2': 0, 'p-1': 1}
    

    However, you may not want this double counting where the code is shared by more than one diagnosis. In that case we can dedup the results.

    g.V().hasLabel('pat').as('p').
      out('contracted').
      group().
        by(select('p').id()).
        by(in('includes').dedup().count())
    

    which reduces the count for p-0 to 2 and removes p-2 completely as there are no codes.

    {'p-0': 2, 'p-1': 1}
    

    UPDATED

    Based on additional discussion in comments, this query can use the groupCount results as a filter.

    g.V().hasLabel('pat').as('p').
      outE('contracted').inV().
      where(
        group().
          by(select('p').id()).
          by(in('includes').dedup().count()).
        select(values).unfold().is(2)).
        inE().outV().
        path().by(elementMap())
    

    When rendered visually

    enter image description here