Deep graph traversals with labelled output

I'm trying to write a function to generate a Gremlin query. The input of the function is an array of array of strings, with the names of relationships we want to return from the graph. The graph contains information on TV and movies. So an example input would be: [[seasons, episodes, talent], [studios, movies, images]] The strings refer to edge names.

I need to return a JSON object containing the IDs for the vertices labelled by their edge names but I'm finding the Germlin query very difficult.

So far I've managed to write this query:

g.V('network_1').out().where(__.inE().
    hasLabel('seasons')).
  group().
    by(__.inE().label()).
    by(__.group().by(T.id).
        by(__.out().where(__.inE().
            hasLabel('episodes')).
          group().
            by(__.inE().label()).
            by(__.group().by(T.id).
                by(__.out().where(__.inE().
                    hasLabel('talent')).
                  group().
                    by(__.inE().label()).by(T.id))))).
  next()

Which gives this output:

{
  "seasons": {
    "season_2": {
      "episodes": {
        "episode_4": {
          "talent": [
            "talent_8",
            "talent_6",
            "talent_7"
          ]
        }
      }
    },
    "season_1": {
      "episodes": {
        "episode_2": {
          "talent": [
            "talent_2",
            "talent_3"
          ]
        },
        "episode_3": {
          "talent": [
            "talent_4",
            "talent_5"
          ]
        },
        "episode_1": {
          "talent": [
            "talent_1"
          ]
        }
      }
    }
  }
}

That output is exactly the kind of thing I'm looking for however the problems are:

That query seems hugely over complicated
The array of edges to query could be any size. In my example its 3, but it could be anything.
In the example there are 2 arrays of edges to query, which ideally I could combine into one query

I'm writing this in Python, and would be hugely appreciative of any help or pointers.

Example content:

g.addV('show').property('id', 'show_1').as('show_1').
  addV('season').property('id', 'season_1').as('season_1').
  addV('season').property('id', 'season_2').as('season_2').
  addV('episode').property('id', 'episode_1').as('episode_1').
  addV('episode').property('id', 'episode_2').as('episode_2').
  addV('episode').property('id', 'episode_3').as('episode_3').
  addV('episode').property('id', 'episode_4').as('episode_4').
  addV('talent').property('id', 'talent_1').as('talent_1').
  addV('talent').property('id', 'talent_2').as('talent_2').
  addV('talent').property('id', 'talent_3').as('talent_3').
  addV('talent').property('id', 'talent_4').as('talent_4').
  addV('talent').property('id', 'talent_5').as('talent_5').
  addV('talent').property('id', 'talent_6').as('talent_6').
  addV('talent').property('id', 'talent_7').as('talent_7').
  addV('talent').property('id', 'talent_8').as('talent_8').
  addE('seasons').from('show_1').to('season_1').
  addE('seasons').from('show_1').to('season_2').
  addE('episodes').from('season_1').to('episode_1').
  addE('episodes').from('season_1').to('episode_2').
  addE('episodes').from('season_1').to('episode_3').
  addE('episodes').from('season_2').to('episode_4').
  addE('talent').from('episode_1').to('talent_1').
  addE('talent').from('episode_2').to('talent_2').
  addE('talent').from('episode_2').to('talent_3').
  addE('talent').from('episode_3').to('talent_4').
  addE('talent').from('episode_3').to('talent_5').
  addE('talent').from('episode_4').to('talent_6').
  addE('talent').from('episode_4').to('talent_7').
  addE('talent').from('episode_4').to('talent_8').iterate()

Solution

For JVM language variants of Gremlin, I think tree() would be quite helpful to you:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   tree().
......4>     by('id').next()
==>show_1={season_2={episode_4={talent_6={}, talent_8={}, talent_7={}}}, season_1={episode_2={talent_3={}, talent_2={}}, episode_3={talent_5={}, talent_4={}}, episode_1={talent_1={}}}}

but to the best of my recollection tree() off of the JVM, in your case Python, isn't well supported. You might try it though.

Another option, one more tuned to Python right now, is to do some nested grouping as you have done in your example. You note it as complex, but I think it only so because of the backtrack filtering everywhere. I'd also add that while it might appear to work, I sense that it might not quite work in all cases given the use of by(__.inE().label()) to group on as that only looks at the first edge label for each vertex being grouped. It relies on the structure of the data to be successful, so it might set you up for a bug in the future if suddenly inE() returned something you didn't expect. I suppose you could limit that chance by adding the label like inE('seasons).label()` but that seems a bit off.

I tend to favor Gremlin that is immediately readable as to its intent. As such, I took the following approach (it doesn't exactly match the output you provided with all the key values but I think you will find the data to match what you want:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).
......8>        group().
......9>          by(limit(local,1)).
.....10>          by(tail(local,2).
.....11>             group().
.....12>               by(limit(local,1)).
.....13>               by(tail(local).fold())))
==>[show_1:[season_2:[episode_4:[talent_6,talent_7,talent_8]],season_1:[episode_2:[talent_2,talent_3],episode_3:[talent_4,talent_5],episode_1:[talent_1]]]]

I like this approach because the navigation part is so simple and direct - out() over "seasons", out() over "episodes" and out() over "talent". There is no question as to what data is being gathered. At line 3 we gather the path and then do a nested group over it to build a similar tree-like structure that I'd generated with tree()-step. In fact this one is a bit nicer in terms of output because it doesn't include empty leaves.

To pick this apart a bit further, start by considering the base output we're working with:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id')
==>[show_1,season_1,episode_1,talent_1]
==>[show_1,season_1,episode_2,talent_2]
==>[show_1,season_1,episode_2,talent_3]
==>[show_1,season_1,episode_3,talent_4]
==>[show_1,season_1,episode_3,talent_5]
==>[show_1,season_2,episode_4,talent_6]
==>[show_1,season_2,episode_4,talent_7]
==>[show_1,season_2,episode_4,talent_8]

We want to group on each layer of those paths, which means doing an nested group(). Consider the first layer:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).fold())
==>[show_1:[[season_1,episode_1,talent_1],[season_1,episode_2,talent_2],[season_1,episode_2,talent_3],[season_1,episode_3,talent_4],[season_1,episode_3,talent_5],[season_2,episode_4,talent_6],[season_2,episode_4,talent_7],[season_2,episode_4,talent_8]]]

The above puts all the "shows" together. Note how we've used tail(local,3) to remove "show_1" from each path object since we've already grouped on it. Next we want to group the "seasons" so:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).
......8>        group().
......9>          by(limit(local,1)).
.....10>          by(tail(local,2).fold()))
==>[show_1:[season_2:[[episode_4,talent_6],[episode_4,talent_7],[episode_4,talent_8]],season_1:[[episode_1,talent_1],[episode_2,talent_2],[episode_2,talent_3],[episode_3,talent_4],[episode_3,talent_5]]]]

Here we know that "seasons" are in the first position so we take the first with limit(local,1) and as we no longer need seasons for further grouping we chop it off the path with tail(local,2). It's "2" this time instead of "3" because the path we are reducing is shortened to just season->episode->talent and now with "2" we go to just episode->talent. Hopefully that breaks down what's happening a bit further and you can adapt this query to your needs.