Search code examples
neo4jcypherquerying

Why does the query to find intermediate nodes take so long?


The database has a graph with the following 3 nodes:

...->(1) ------>(3)-->...
   \             ^
    \            |
     ---->(2)---/

Now, I want to get all distinct nodes that are reachable from node 1 to node 3, including themselves where I know exactly unique properties of node 1 and node 3 (the nodes are actually commits from a github repository). So, I came up with the following query:

MATCH (origin:App)
WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
MATCH (destination:App)
WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
MATCH (origin)-[:CHANGED_TO*0..]->(intermediate_commit:App)-[:CHANGED_TO*0..]->(destination)
RETURN distinct intermediate_commit

However, the query never finishes or at least takes too long to complete. I know that I could have used MATCH p=(origin:App)-[:CHANGED_TO*0..]->(destination:App) and then UNWIND and return distinct nodes. The problem is, I believe, it queries different paths implying I am interested in relationships between them too. While in fact I am not interested in paths. What I need is only distinct nodes that match the pattern. My understanding is that querying paths is slower than it could be if I could query only the nodes.

Could you please help to understand what I am missing? Thanks!


Solution

  • The solution was quite simple. Instead of specifying a pattern in MATCH clause, we move the pattern to WHERE clause. Also, I split the pattern into 2 parts. I can't explain why exactly it is faster but my understanding is that when we move pattern to WHERE clause and MATCH only nodes, we let neo4j know that we are interested only in nodes and not in all possible paths that match the pattern.

    The full query:

    MATCH (origin:App)
      WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
    MATCH (destination:App)
      WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
    MATCH (intermediate_commit:App)
      WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
        AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
    RETURN distinct intermediate_commit
    

    Also, if you have a lot of nodes, I believe specifying LIMIT 1 to match origin and destination can also improve the query, like this:

    MATCH (origin:App)
      WHERE origin.commit='10cb31b0a72525923c01dc34f8690f311a361d42'
    WITH origin
    LIMIT 1
    MATCH (destination:App)
      WHERE destination.commit='51fde433973463f057ffcbcbab0bc8944ab3ec9c'
    WITH origin, destination
    LIMIT 1
    MATCH (intermediate_commit:App)
      WHERE (origin)-[:CHANGED_TO*0..]->(intermediate_commit)
        AND (intermediate_commit)-[:CHANGED_TO*0..]->(destination)
    RETURN distinct intermediate_commit