Search code examples
neo4jcypherquery-optimization

Neo4j query performance for node/relationship approach versus property based filtering


I have two Neo4j Cypher queries whcih return the same result:

#1 Based purely on property filtering approach:

MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )  
WHERE  any(id IN childD.`employmentIds`  WHERE id IN [16])  
AND  any(id IN childD.`timeZoneIds`  WHERE id IN [11])  
AND  any(id IN childD.`companyTypeIds`  WHERE id IN [1])  
AND  (childD.`active` = true)  
AND ( (childD.`hourlyRateUsd` >= 120)  OR  (childD.`salaryUsd` >= 13691) )
AND  any(id IN childD.`locationIds`  WHERE id IN [6, 7, 8, 9, 10])  
AND  any(id IN childD.`employmentTypeIds`  WHERE id IN [21, 22, 23, 24])  
WITH childD  
RETURN count(childD)

Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 197913 total db hits in 198 ms.

enter image description here

#2 Mostly Node/Relationship matching with few property filters:

MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )  
MATCH (childD)-[:WORK_TIME_ZONE]-(req4:TimeZone:Requirable) 
WHERE req4.id IN [11] 
WITH DISTINCT childD  
MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Employment:Requirable) 
WHERE req5.id IN [16] 
WITH DISTINCT childD  
MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:EmploymentType:Requirable) 
WHERE req6.id IN [21, 22, 23, 24] 
WITH DISTINCT childD  
MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:CompanyType:Requirable) 
WHERE req7.id IN [1] 
WITH DISTINCT childD  
MATCH (childD)-[:LOCATED_IN]-(req8:Location:Requirable) 
WHERE req8.id IN [6, 7, 8, 9, 10] 
WITH DISTINCT childD  
WHERE  (childD.`active` = true)  
AND ( (childD.`salaryUsd` >= 13691)  OR  (childD.`hourlyRateUsd` >= 120) ) 
WITH childD  
RETURN count(childD)

Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 2559218 total db hits in 1178 ms.

enter image description here

As you may see, the approach #2 based on Nodes and Relationships took 2559218 DB hits versus 197913 DB hits for the node property filtering approach. I like the speed of approach #1, but I feel that the approach #1 is not scalable… This is not the Neo4j graph way. What am I doing wrong within the approach #2 and how to improve it? How to get the #1 performance and Neo4j scalability feature there? Now it is only 20k nodes. What if I have 100k or 1000k ? How to get #1 performance and Neo4j scalability feature? What approach should I select? P.S

In order to be safe - in my Neo4j schema I support both - nodes/relationships and aggregated properties. But I'm definitely doing something wrong with Nodes/Relationships.. please help to solve it! Thanks!

UPDATED

Thanks to the answer below, I updated the query to the following one:

MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Jobable )
WHERE  exists{ 
MATCH (childD)-[:WORK_TIME_ZONE]-(req4:Requirable) 
WHERE req4.id IN [11]} 
AND exists{ 
MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Requirable) 
WHERE req5.id IN [16]}
AND exists{ 
MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:Requirable) 
WHERE req6.id IN [21, 22, 23, 24]}
AND exists{ 
MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:Requirable) 
WHERE req7.id IN [1]}
AND exists{ 
MATCH (childD)-[:LOCATED_IN]-(req8:Requirable) 
WHERE req8.id IN [6, 7, 8, 9, 10] 
}
WITH childD
WHERE (childD.`active` = true)  
AND ( (childD.`salaryUsd` >= 13691)  OR  (childD.`hourlyRateUsd` >= 120)) 
RETURN count(childD)

Not it works much more better:

Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 691112 total db hits in 675 ms.

enter image description here


Solution

  • There are a couple of improvements you can make to improve your query. First of all, I would use the node filters first. If you had an index on the active property that might also help. Second of all, instead of doing a bunch of MATCH clauses and then deduplicating you can use existential subqueries.

    MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )
    WHERE  (childD.`active` = true)  
    AND ( (childD.`salaryUsd` >= 13691)  OR  (childD.`hourlyRateUsd` >= 120)) 
    AND exists{ 
    MATCH (childD)-[:WORK_TIME_ZONE]-(req4:TimeZone:Requirable) 
    WHERE req4.id IN [11]} 
    AND exists{ 
    MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Employment:Requirable) 
    WHERE req5.id IN [16]}
    AND exists{ 
    MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:EmploymentType:Requirable) 
    WHERE req6.id IN [21, 22, 23, 24]}
    AND exists{ 
    MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:CompanyType:Requirable) 
    WHERE req7.id IN [1]}
    AND exists{ 
    MATCH (childD)-[:LOCATED_IN]-(req8:Location:Requirable) 
    WHERE req8.id IN [6, 7, 8, 9, 10] 
    }
    RETURN count(childD)
    

    Again, you might omit some node labels in the existential subqueries to further improve your queries.