Search code examples
gremlinamazon-neptunegremlinpython

Gremlin query to group by multiple columns (from vertex, and edge)


I have a developer vertex and a project vertex and an edge with start date and end date properties. A developer can work only on one project at a given point of time. Sometimes they are assigned to multiple projects with the same start date by user by mistake.

  1. I need to find out if a developer has more than one project assigned to them with same start date. If it is then I need to print

      001 Akash 2021-06-01  2
    
  2. If you look at 3rd developer, he was assigned to 2 projects starting 2021-07-01. But one of them is valid as it has an end date which has a corresponding record with start date of 2021-09-01. In my 2nd report I need to list

{"id":"003_P003","label":"works_in","start_date":"2021-07-01","end_date":"2021-07-05"} which is duplicate and odd one.

I tried following query for #1,but it does not show me both id, start date and count. It only shows me developer's id and count

 g.V().
  hasLabel('developer').
  outE('works_in').
  groupCount().by(outV().id()).
  groupCount().by('start_date').
  limit(2).
  unfold().
  toList()

Then I tried next two which did not work and shows error message "‘Column’ object is not callable" (I am using Python Germlin)

 g.V().
  hasLabel('developer').
  outE('works_in').by(values(outV().id(), 'start_date')).
  groupCount().
  unfold().
  toList() 

The next one below also throws error.

 g.V().
  hasLabel('developer').
  outE('works_in').
  groupCount().by(outV().id(), 'start_date').
  unfold().
  toList()

#2 I did not start yet. I am not sure if it can be done with Gremlin though.

developer (vertex)

[{"id":"001","label":"developer","name":"Akash","skill":"c#"},
{"id":"002","label":"developer","name":"John","skill":"react"}, 
{"id":"003","label":"developer","name":"Bruno","skill":"python"}]

project (vertex)

[{"id":"P001","label":"project","name":"Web App"},{"id":"P002","label":"project","name":"Smart Contract"}, {"id":"P003","label":"project","name":"Migrate to AWS"}]

works_in (edge)

    [{"id":"001_P001","label":"works_in","start_date":"2021-06-01","end_date":"2021-12-31"},
    {"id":"002_P002","label":"works_in","start_date":"2021-01-01","end_date":"2021-12-31"}, 
    {"id":"001_P002","label":"works_in","start_date":"2021-06-01","end_date":"2021-06-30"},
    
    {"id":"003_P003","label":"works_in","start_date":"2021-01-01","end_date":"2021-06-30"},
    {"id":"003_P003","label":"works_in","start_date":"2021-07-01","end_date":"2021-07-05"},
    {"id":"003_P002","label":"works_in","start_date":"2021-07-01","end_date":"2021-08-31"},
    {"id":"003_P002","label":"works_in","start_date":"2021-09-01","end_date":"2021-12-31"}

]

Any help is appreciated. Also I am looking for tutorial/course to understand Gremlin query better other than Tinkerpop's official documentation.


Solution

  • Query for the first usecase:

       gremlin> g.V().
    ......1>   hasLabel('developer').
    ......2>   local(
    ......3>     __.as('b').
    ......4>     outE().
    ......5>     project('id', 'name', 'startDate').
    ......6>       by(select('b').values('id')).
    ......7>       by(select('b').values('name')).
    ......8>       by(values('startDate')).
    ......9>     groupCount().
    .....10>     unfold().as('a').
    .....11>     select(values).
    .....12>     is(gt(1)).
    .....13>     select('a')).
    .....14>   local(
    .....15>     union(select(keys).unfold().select(values), select(values)).fold())
    ==>[003,bruno,2021-07-01,2]
    ==>[001,akash,2021-06-01,2]
    

    Query for the second usecase

    I couldn't write any straight forward query for the second usecase.