Search code examples
neo4jcypher

Optimal Neo4j index strategy for multi-labeled nodes


I'm using Neo4j to represent our data warehouse. We have ~ 100,000 nodes of various types (~ 10) some of which have multiple labels. A subset of the typical node types are:

(:User)
(:Tableau:Workbook)
(:Tableau:Dashboard)

Here Tableau represents the data visualization software and Workbook and Dashboard are different Tableau entities. The reason we went with multiple labels rather than a single uniquely defined label was one may want to match all (:Tableau) nodes or all (:Dashboard) nodes (we have multiple dashboard sources).

I'm also using the GraphAware Neo4j UUID library (https://github.com/graphaware/neo4j-uuid) to ensure that each node (regardless of type) is uniquely identified via the uuid node property.

I've created an index (and uniqueness constraint) for every node label to improve performance, i.e.

CREATE INDEX ON:User(uuid)
CREATE INDEX ON:Tableau(uuid)
CREATE INDEX ON:Workbook(uuid)
CREATE INDEX ON:Dashboard(uuid)

given that CREATE INDEX must take exactly one label.

I've run into a few performance issues when matching nodes using Cypher given this indexing structure. Even though the cardinality of (:Tableau:Dashboard) << (:Tableau) the following query is suboptimal

MATCH (n:Tableau:Dashboard) WHERE n.uuid = <UUID> 

compared to either

MATCH (n:Tableau) WHERE n.uuid = <UUID> 
MATCH (n:Dashboard) WHERE n.uuid = <UUID> 

given that the former doesn't leverage any index though the later do. This problem is compounded if one wants to find a node globally based solely on the UUID (which is unique) which is often the case when we're using a Flask API to find nodes, which translates to the following Cypher logic:

MATCH(n) WHERE n.uuid = <UUID>

The following thread suggests creating an Entity overarching global node label and creating an index on that ( Neo4j: Create index for nodes with same property),

CREATE INDEX ON:Entity(uuid)

so now nodes with be labeled as follows,

(:Entity:User)
(:Entity:Tableau:Workbook)
(:Entity:Tableau:Dashboard)

Is this the best approach? Another solution would be to just choose the first label if multiple labels are defined given that it's guaranteed to be indexed, however it doesn't resolve the issue of finding a node solely based on the UUID.

If I go with the Entity label approach does it still make sense to keep all the previously defined indexes, i.e. would I expect to see significant performance improvements if I'm only searching a small subset of the nodes? For example if I knew that n was a (:User) node should I expect to see similar performances with,

MATCH (n:Entity) WHERE n.uuid = <UUID>
MATCH (n:User) WHERE n.uuid = <UUID>

Not having the ability to index on none or multiple indexes is a shame given that optimal Cypher queries may be more abstract, i.e. let's say that a (:Tableau:Workbook) populates a (:Tableau:Dashboard) then to find the dashboards that the workbook populates one would query,

MATCH (s:Tabeau:Workbook)-[:POPULATES]->(t:Tableau:Dashboard)
WHERE s.uuid = <UUID>
RETURN t

which is fairly transparent, however the following would be more optimal from a performance perspective, though less transparent given that it's not obvious to the user what type of node s is,

MATCH (s:Entity)-[:POPULATES]->(t:Tableau:Dashboard)
WHERE s.uuid = <UUID>
RETURN t

Solution

  • You are maintaining overlapping indexes for Tableau and Workbook and Tableau and Dashboard. Why not just maintain the index for Tableau to eliminate redundancy and throw the query planner a hint with USING INDEX to make sure it is used in your match. i.e. something like this...

    MATCH (s:Tableau:Workbook)-[:POPULATES]->(t:Tableau:Dashboard)
    USING INDEX s:Tableau(uuid)
    WHERE s.uuid = <UUID>
    RETURN t