Search code examples
neo4jcypherquery-optimizationgtfs

Simple Cypher Query for apoc dijkstra taking FOREVER


Maybe I am very stupid or Neo4j is not supposed to be fast. (Disclaimer: I am a Neo4j noob)

I have the following simple dijkstra query which is taking forever to run. I have to atleast wait for 5-10 minutes for it to execute.Sometimes my Chrome browser crashes because of it.

Sample Graph

enter image description here

Cypther Query

profile MATCH  (startNode:Stop)--(st:Stoptime),
(endNode:Stop)--(et:Stoptime)
where  endNode.name = 'Hauptbahnhof Süd' and 
(startNode.name = 'Schlump' or startNode.name = 'U Schlump')     
call apoc.algo.dijkstra(st, et, 'PRECEDES', 'weight') YIELD path, weight
return startNode, endNode, path, weight
limit 100;

Computer Config

I am using a Ubuntu VM on windows machine which has 24GB Ram and 6 Cpus.

Indexes

enter image description here

Sysinfo enter image description here

When I run profile on the above Query, i get the following information:

Profile Information

enter image description here

For the love of God, I cant figure out, where the bottleneck lies. I have checked all other answers on this, but to no avail.


Solution

  • Since I don't have the data set to test out my suggestion with, I can only point you in the direction that I would look. Hopefully, it leads you to the answer.

    In looking at the profile and query I see that startNode and endNode are both type :Stop and that the Stop.name property is indexed.

    When looking for endNode.name = 'Hauptbahnhof Süd' there are 3 estimated rows and 3 rows are returned.

    However when looking for (startNode.name = 'Schlump' or startNode.name = 'U Schlump') there are 6 estimated rows, but 14827 returned.

    Are there indeed 14827 :Stop nodes that contain either 'Schlump' or 'U Schlump'?

    Or is it the 6 estimated rows? If the latter is the case can you run the query without the OR:

    where endNode.name = 'Hauptbahnhof Süd' and startNode.name = 'Schlump'

    to see what the profiler comes up with.

    If that performs as expected then the solution may be to rewrite the query to include that OR logic in a different format?

    Perhaps where endNode.name = 'Hauptbahnhof Süd' and startNode.name IN ['Schlump','U Schlump']

    Also found this older answer indicating an issue with the OR operator and indexes prior to 3.2.

    I had remembered seeing another recent answer about some issue with OR, but can't seem to locate it now.

    Good luck!