gremlin tinkerpop tinkerpop3 gremlin-server

Gremlin: How can I merge groups of vertices when they are similar

My query returns groups of users vertices like this:

[
    [Pedro, Sabrina, Macka, Fer]
    [Pedro, Sabrina, Macka, Fer, Britney]
    [Brintey, Fred, Christina] 
]

The first 2 groups are similar, contains mostly the same vertices. I need to merge them. I need to merge the groups that are like for example 80% similar (80% of the elements are the same).

Is this possible in gremlin? how can I do this?

Edit: https://gremlify.com/2ykos4047g5

This gremlify project creates a fake output similar to what I have in my query, I need the first 2 lists merged into a sigle one because they contain almost the same vertices and not the third one because it's completely different from the others.

So what I'm asking is how you write a query that compares all lists checking how many vertices are the same in these lists and based on that decide if merging them into a single one or not.

The expected output for the gremlify project is:

[
  [
    "Pedro",
    "Sabrina",
    "Macka",
    "Fer",
    "Britney"
  ],
  [
    "Garry",
    "Dana",
    "Lily"
  ]
]

Solution

Gremlin doesn't have steps that merge lists based on how much they are alike. Gremlin is fairly flexible so I imagine there might be ways to use its steps in creative ways to get what you want, but the added complexity may not be worth it. My personal preference is to use Gremlin to retrieve my data, filter away the whatever is extraneous, and then transform it as close as possible to its final result while maintaining a balance with readability.

Given that thinking, if your result from Gremlin is simply a list of lists of strings and your Gremlin up to that point is well structured and performant, then perhaps Gremlin has gotten you far enough and his job is done. Take that result and post-process it on your application side by writing some code to take you to your final result. With that approach you have your full programming language environment at your disposal with all the libraries available to you to make that final step easier.

I'd further add that your example is a bit contrived and focuses on an arbitrary result which reduces your Gremlin question to collection manipulation question. With graphs and Gremlin I often find that heavy focus on collection manipulation to improve the quality of a result (rather than just format of a result) implies that I should go back to my core of my traversal algorithm rather than try to tack on extra manipulation at the end of the traversal.

For example, if this output you're asking about in this question relates back to your previous questions here and here, then I'd wonder if you shouldn't rethink the rules of your algorithm. Perhaps, you really aren't "detecting triangles and then trying to group them accordingly" as I put it in one of my answers there. Maybe there is a completely different algorithm that will solve your problem which is even more effective and performant.

This blog post, "Reducing Computational Complexity with Correlate Traversals", does an excellent job in explaining this general notion. Though it focuses on centrality algorithms, the general message is quite clear:

All centrality measures share a similar conceptual theme — they all score the vertices in the graph according to how “central” they are relative to all other vertices. It is this unifying concept which can lead different algorithms to yield the same or similar results. Strong, positive correlations can be taken advantage of by the graph system architect, enabling them to choose a computationally less complex metric when possible.

In your case, perhaps you need more flexibility in the rules you stated for your algorithm to thus allow a better (i.e. less rigid) grouping in your results. In any case, it is something to think about and in the worst case you can obviously just take the brute force approach that you describe in your question and get your result.