Search code examples
performancegraph-databases

Best solution for filtering


I'm new to graphDB and I'm studying to create a good data model.

I have to manage 10 millions of "Contacts" and I would like to filter them by "gender". I create a POC and all is fine but I don't understand/find if the best solution is to save the gender as vertex:

gender as vertex

or as a field on the contacts vertex:

gender as field

I know that each edge will impact on the data size, but I don't find any reference on performance diff on these two types of data management.

Do you know the right approach?


Solution

  • In this use case, I would put gender as a property on the vertex and add an index on that property to get your answer. While having gender as a separate vertex is more correct from a theoretical perspective it has a few practical issues that leads me to suggest the second approach.

    1. The first model you suggest will introduce a supernode into your graph. A supernode is a node with a disproportionately high number of incident edges. The Gender vertex will have a low selectivity (Male/Female/Unknown) so each vertex will have a branching factor that is in the millions. This level of a branching factor will likely cause all sorts of performance problems resulting in a slow query. Denormalizing the gender on to the vertex and adding an index should resolve most of these issues. The only issue that is likely to remain is the amount of time it will take to return the 3-5 million records you will likely receive.
    2. In the first approach answering the question of "What is a person's gender?" would require traversing out from the contact vertex to the edge to the gender vertex which would be slower than just pulling back the contact vertex. Assuming this is a frequent query you would want to answer than this is a consideration you should take into account.