I have two Neo4j Cypher queries whcih return the same result:
#1 Based purely on property filtering approach:
MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )
WHERE any(id IN childD.`employmentIds` WHERE id IN [16])
AND any(id IN childD.`timeZoneIds` WHERE id IN [11])
AND any(id IN childD.`companyTypeIds` WHERE id IN [1])
AND (childD.`active` = true)
AND ( (childD.`hourlyRateUsd` >= 120) OR (childD.`salaryUsd` >= 13691) )
AND any(id IN childD.`locationIds` WHERE id IN [6, 7, 8, 9, 10])
AND any(id IN childD.`employmentTypeIds` WHERE id IN [21, 22, 23, 24])
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 197913 total db hits in 198 ms.
#2 Mostly Node/Relationship matching with few property filters:
MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )
MATCH (childD)-[:WORK_TIME_ZONE]-(req4:TimeZone:Requirable)
WHERE req4.id IN [11]
WITH DISTINCT childD
MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Employment:Requirable)
WHERE req5.id IN [16]
WITH DISTINCT childD
MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:EmploymentType:Requirable)
WHERE req6.id IN [21, 22, 23, 24]
WITH DISTINCT childD
MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:CompanyType:Requirable)
WHERE req7.id IN [1]
WITH DISTINCT childD
MATCH (childD)-[:LOCATED_IN]-(req8:Location:Requirable)
WHERE req8.id IN [6, 7, 8, 9, 10]
WITH DISTINCT childD
WHERE (childD.`active` = true)
AND ( (childD.`salaryUsd` >= 13691) OR (childD.`hourlyRateUsd` >= 120) )
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 2559218 total db hits in 1178 ms.
As you may see, the approach #2 based on Nodes and Relationships took 2559218
DB hits versus 197913
DB hits for the node property filtering approach. I like the speed of approach #1, but I feel that the approach #1 is not scalable… This is not the Neo4j graph way. What am I doing wrong within the approach #2 and how to improve it? How to get the #1 performance and Neo4j scalability feature there? Now it is only 20k nodes. What if I have 100k or 1000k ? How to get #1 performance and Neo4j scalability feature? What approach should I select?
P.S
In order to be safe - in my Neo4j schema I support both - nodes/relationships and aggregated properties. But I'm definitely doing something wrong with Nodes/Relationships.. please help to solve it! Thanks!
UPDATED
Thanks to the answer below, I updated the query to the following one:
MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Jobable )
WHERE exists{
MATCH (childD)-[:WORK_TIME_ZONE]-(req4:Requirable)
WHERE req4.id IN [11]}
AND exists{
MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Requirable)
WHERE req5.id IN [16]}
AND exists{
MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:Requirable)
WHERE req6.id IN [21, 22, 23, 24]}
AND exists{
MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:Requirable)
WHERE req7.id IN [1]}
AND exists{
MATCH (childD)-[:LOCATED_IN]-(req8:Requirable)
WHERE req8.id IN [6, 7, 8, 9, 10]
}
WITH childD
WHERE (childD.`active` = true)
AND ( (childD.`salaryUsd` >= 13691) OR (childD.`hourlyRateUsd` >= 120))
RETURN count(childD)
Not it works much more better:
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 691112 total db hits in 675 ms.
There are a couple of improvements you can make to improve your query. First of all, I would use the node filters first. If you had an index on the active
property that might also help. Second of all, instead of doing a bunch of MATCH
clauses and then deduplicating you can use existential subqueries.
MATCH (dg:DecisionGroup {id: -3})-[rdgd:CONTAINS]->(childD:Vacancy )
WHERE (childD.`active` = true)
AND ( (childD.`salaryUsd` >= 13691) OR (childD.`hourlyRateUsd` >= 120))
AND exists{
MATCH (childD)-[:WORK_TIME_ZONE]-(req4:TimeZone:Requirable)
WHERE req4.id IN [11]}
AND exists{
MATCH (childD)-[:EMPLOYMENT_AS]-(req5:Employment:Requirable)
WHERE req5.id IN [16]}
AND exists{
MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req6:EmploymentType:Requirable)
WHERE req6.id IN [21, 22, 23, 24]}
AND exists{
MATCH (childD)-[:COMPANY_TYPE_OF]-(req7:CompanyType:Requirable)
WHERE req7.id IN [1]}
AND exists{
MATCH (childD)-[:LOCATED_IN]-(req8:Location:Requirable)
WHERE req8.id IN [6, 7, 8, 9, 10]
}
RETURN count(childD)
Again, you might omit some node labels in the existential subqueries to further improve your queries.