Search code examples
neo4jcypherquery-optimization

Neo4j direct MATCH vs CALL apoc.cypher.mapParallel2


I'm trying to understand what is the most proper and scalable way to write Cypher queries.

Right now I have two approaches

#1 with pure direct MATCH:

MATCH  ( childD:Vacancy )  
WHERE  exists { MATCH (childD)-[:EMPLOYMENT_AS]-(req0:Requirable) WHERE req0.id IN [21, 22, 23, 24]}  
AND  exists { MATCH (childD)-[:WORK_TIME_ZONE]-(req1:Requirable) WHERE req1.id IN [11, 12, 13, 14]}  
AND  exists { MATCH (childD)-[:COMPANY_TYPE_OF]-(req2:Requirable) WHERE req2.id IN [1, 2]}  
AND  exists { MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req3:Requirable) WHERE req3.id IN [26, 27]}  
AND  exists { MATCH (childD)-[:LOCATED_IN]-(req4:Requirable) WHERE req4.id IN [6, 7]}  
AND  (childD.`active` = true)  AND ( (childD.`salaryUsd` >= 15330)  OR  (childD.`hourlyRateUsd` >= 81) ) 
WITH childD 
RETURN count(childD)

Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 231995 total db hits in 205 ms.

enter image description here

#2 the same query wrapped with apoc.cypher.mapParallel2

MATCH  ( childD:Vacancy )  
WITH collect({`childD`: childD}) as childDDgRdgd  
CALL apoc.cypher.mapParallel2(" 
WITH _.childD as childD 
WHERE  exists { MATCH (childD)-[:EMPLOYMENT_AS]-(req0:Requirable) WHERE req0.id IN $reqParam0}  
AND  exists { MATCH (childD)-[:WORK_TIME_ZONE]-(req1:Requirable) WHERE req1.id IN $reqParam1}  
AND  exists { MATCH (childD)-[:COMPANY_TYPE_OF]-(req2:Requirable) WHERE req2.id IN $reqParam2}  
AND  exists { MATCH (childD)-[:EMPLOYMENT_TYPE_AS]-(req3:Requirable) WHERE req3.id IN $reqParam3}  
AND  exists { MATCH (childD)-[:LOCATED_IN]-(req4:Requirable) WHERE req4.id IN $reqParam4}  AND  (childD.`active` = $active7)  
AND ( (childD.`salaryUsd` >= $salaryUsd5)  OR  (childD.`hourlyRateUsd` >= $hourlyRateUsd6) ) 
WITH childD 
RETURN childD  
", {`hourlyRateUsd6`:81, `reqParam4`:[6, 7], `reqParam3`:[26, 27], `salaryUsd5`:15330, `reqParam2`:[1, 2], `reqParam1`:[11, 12, 13, 14], `reqParam0`:[21, 22, 23, 24], `active7`:true}, childDDgRdgd, 6, 10) 
YIELD value as value   
WITH value.childD as childD  
RETURN count(childD)

Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 10001 total db hits in 114 ms

enter image description here

As you may see, the approach with apoc.cypher.mapParallel2 works ~2 times faster, but I'm not sure if this is a right and scalable way. Please advise - am I doing right when wrap the query with apoc.cypher.mapParallel2 ? What is the correct and scalable way of executing such query?


Solution

  • Usually, when it comes to scaling queries, data-remodeling is first suggested in place of parallelization. But if the model is as best as it can be, then parallelization is the way to go. Also, apoc.map.parallel2 is expected to give performance benefits, as it aims to utilize all the CPU cores, and combines outputs from multiple threads, likewise, as you have seen.

    So, yes I think it's a good way of executing the query. However, it's also non-performant in some cases, as discussed in this Neo4j Community Discussion. This is an old article, but it will be worth completely trying out all the scenarios, before relying on it.