I'm trying to understand what is the most proper and scalable way to write Cypher queries.
Right now I have two approaches
#1 with pure direct MATCH:
MATCH ( childD:Vacancy )
WHERE exists { MATCH (childD)-[:EMPLOYMENT_AS]-(req0:Requirable) WHERE req0.id IN [21, 22, 23, 24]}
AND exists { MATCH (childD)-[:WORK_TIME_ZONE]-(req1:Requirable) WHERE req1.id IN [11, 12, 13, 14]}
AND exists { MATCH (childD)-[:COMPANY_TYPE_OF]-(req2:Requirable) WHERE req2.id IN [1, 2]}
AND exists { MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req3:Requirable) WHERE req3.id IN [26, 27]}
AND exists { MATCH (childD)-[:LOCATED_IN]-(req4:Requirable) WHERE req4.id IN [6, 7]}
AND (childD.`active` = true) AND ( (childD.`salaryUsd` >= 15330) OR (childD.`hourlyRateUsd` >= 81) )
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 231995 total db hits in 205 ms.
#2 the same query wrapped with apoc.cypher.mapParallel2
MATCH ( childD:Vacancy )
WITH collect({`childD`: childD}) as childDDgRdgd
CALL apoc.cypher.mapParallel2("
WITH _.childD as childD
WHERE exists { MATCH (childD)-[:EMPLOYMENT_AS]-(req0:Requirable) WHERE req0.id IN $reqParam0}
AND exists { MATCH (childD)-[:WORK_TIME_ZONE]-(req1:Requirable) WHERE req1.id IN $reqParam1}
AND exists { MATCH (childD)-[:COMPANY_TYPE_OF]-(req2:Requirable) WHERE req2.id IN $reqParam2}
AND exists { MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req3:Requirable) WHERE req3.id IN $reqParam3}
AND exists { MATCH (childD)-[:LOCATED_IN]-(req4:Requirable) WHERE req4.id IN $reqParam4} AND (childD.`active` = $active7)
AND ( (childD.`salaryUsd` >= $salaryUsd5) OR (childD.`hourlyRateUsd` >= $hourlyRateUsd6) )
WITH childD
RETURN childD
", {`hourlyRateUsd6`:81, `reqParam4`:[6, 7], `reqParam3`:[26, 27], `salaryUsd5`:15330, `reqParam2`:[1, 2], `reqParam1`:[11, 12, 13, 14], `reqParam0`:[21, 22, 23, 24], `active7`:true}, childDDgRdgd, 6, 10)
YIELD value as value
WITH value.childD as childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 10001 total db hits in 114 ms
As you may see, the approach with apoc.cypher.mapParallel2
works ~2 times faster, but I'm not sure if this is a right and scalable way. Please advise - am I doing right when wrap the query with apoc.cypher.mapParallel2
? What is the correct and scalable way of executing such query?
Usually, when it comes to scaling queries, data-remodeling is first suggested in place of parallelization. But if the model is as best as it can be, then parallelization is the way to go. Also, apoc.map.parallel2
is expected to give performance benefits, as it aims to utilize all the CPU cores, and combines outputs from multiple threads, likewise, as you have seen.
So, yes I think it's a good way of executing the query. However, it's also non-performant in some cases, as discussed in this Neo4j Community Discussion. This is an old article, but it will be worth completely trying out all the scenarios, before relying on it.