I'm using Neo4j to represent our data warehouse. We have ~ 100,000 nodes of various types (~ 10) some of which have multiple labels. A subset of the typical node types are:
(:User)
(:Tableau:Workbook)
(:Tableau:Dashboard)
Here Tableau
represents the data visualization software and Workbook
and Dashboard
are different Tableau entities. The reason we went with multiple labels rather than a single uniquely defined label was one may want to match all (:Tableau)
nodes or all (:Dashboard)
nodes (we have multiple dashboard sources).
I'm also using the GraphAware Neo4j UUID library (https://github.com/graphaware/neo4j-uuid) to ensure that each node (regardless of type) is uniquely identified via the uuid
node property.
I've created an index (and uniqueness constraint) for every node label to improve performance, i.e.
CREATE INDEX ON:User(uuid)
CREATE INDEX ON:Tableau(uuid)
CREATE INDEX ON:Workbook(uuid)
CREATE INDEX ON:Dashboard(uuid)
given that CREATE INDEX
must take exactly one label.
I've run into a few performance issues when matching nodes using Cypher given this indexing structure. Even though the cardinality of (:Tableau:Dashboard)
<< (:Tableau)
the following query is suboptimal
MATCH (n:Tableau:Dashboard) WHERE n.uuid = <UUID>
compared to either
MATCH (n:Tableau) WHERE n.uuid = <UUID>
MATCH (n:Dashboard) WHERE n.uuid = <UUID>
given that the former doesn't leverage any index though the later do. This problem is compounded if one wants to find a node globally based solely on the UUID (which is unique) which is often the case when we're using a Flask API to find nodes, which translates to the following Cypher logic:
MATCH(n) WHERE n.uuid = <UUID>
The following thread suggests creating an Entity
overarching global node label and creating an index on that ( Neo4j: Create index for nodes with same property),
CREATE INDEX ON:Entity(uuid)
so now nodes with be labeled as follows,
(:Entity:User)
(:Entity:Tableau:Workbook)
(:Entity:Tableau:Dashboard)
Is this the best approach? Another solution would be to just choose the first label if multiple labels are defined given that it's guaranteed to be indexed, however it doesn't resolve the issue of finding a node solely based on the UUID.
If I go with the Entity
label approach does it still make sense to keep all the previously defined indexes, i.e. would I expect to see significant performance improvements if I'm only searching a small subset of the nodes? For example if I knew that n
was a (:User)
node should I expect to see similar performances with,
MATCH (n:Entity) WHERE n.uuid = <UUID>
MATCH (n:User) WHERE n.uuid = <UUID>
Not having the ability to index on none or multiple indexes is a shame given that optimal Cypher queries may be more abstract, i.e. let's say that a (:Tableau:Workbook)
populates a (:Tableau:Dashboard)
then to find the dashboards that the workbook populates one would query,
MATCH (s:Tabeau:Workbook)-[:POPULATES]->(t:Tableau:Dashboard)
WHERE s.uuid = <UUID>
RETURN t
which is fairly transparent, however the following would be more optimal from a performance perspective, though less transparent given that it's not obvious to the user what type of node s
is,
MATCH (s:Entity)-[:POPULATES]->(t:Tableau:Dashboard)
WHERE s.uuid = <UUID>
RETURN t
You are maintaining overlapping indexes for Tableau
and Workbook
and Tableau
and Dashboard
. Why not just maintain the index for Tableau
to eliminate redundancy and throw the query planner a hint with USING INDEX
to make sure it is used in your match. i.e. something like this...
MATCH (s:Tableau:Workbook)-[:POPULATES]->(t:Tableau:Dashboard)
USING INDEX s:Tableau(uuid)
WHERE s.uuid = <UUID>
RETURN t