Search code examples
gremlingraph-databasesamazon-neptune

Gremlin query optimization to remove redundant data from "path" step


I have person vertex and book vertex connected by owns edge (i.e. person => owns => book). One person can own multiple books.

Let's say I have following vertices & edges which indicates that Tom owns 2 books and Jerry owns 1 book:

{label=person, id=person_1, name=Tom, age=30}
{label=person, id=person_2, name=Jerry, age=40}
{label=book, id=book_1, name=Book1}
{label=book, id=book_2, name=Book2}
{label=book, id=book_3, name=Book3}

person_1 => owns => book_1
person_1 => owns => book_2
person_2 => owns => book_3

I'm able to get which books are owned by whom with following Gremlin query (in Java code):

g.V("person_1", "person_2").outE("owns").inV().path().by(__.valueMap().with(WithOptions.tokens)).toStream().forEach(path -> {
    int size = path.size();
    for (int counter = 0; counter < size; counter++) {
        Map<Object, Object> object = path.get(counter);
        System.out.println(counter + ": " + object);
    }
});

Output is

0: {id=person_1, label=person, name=[Tom], age=[30]}
1: {id=123, label=owns}
2: {id=book_1, label=book, name=[Book1]}
0: {id=person_1, label=person, name=[Tom], age=[30]}   <--------- not surprise, it is same as the first row
1: {id=456, label=owns}
2: {id=book_2, label=book, name=[Book2]}
0: {id=person_2, label=person, name=[Jerry], age=[40]}
1: {id=789, label=owns}
2: {id=book_3, label=book, name=[Book3]}

The outbound vertex (person vertex) is always the same for the books that are owned by the same person. Is it adding overhead for Neptune to retrieve redundant data, or it is adding additional cost for serialization? Assume I have 10 persons, and each person owns 100 books. I don't think dedupe would help here.

How to optimize the query?


Solution

  • If the path itself is not particularly useful to your use case, then you might consider grouping by the "person" such that the person is the key, and the books become the values.

    For example:

    g.V("person_1", "person_2").
      group().
        by().
        by(out().fold())
    

    This will return a map that looks something like this

    {v[person_1]:[v[book_1],v[book_2]],
     v[person_2]:[v[book_3]]}
    

    If the books have a lot of properties, avoiding use of valueMap will reduce the size of the data to be serialized, but if you do need some properties you could selectively pick them. For example:

    g.V(person_1", "person_2").
      group().
        by().
        by(out().valueMap('title').fold())