Search code examples
performancegraphneo4jtraversal

Neo4j performance with cycles


I have a relatively large neo4j graph with 7 millions vertices and 5 millions of relations.

When I try to find out subtree size for one node neo4j is stuck in traversing 600,000 nodes, only 130 of whom are unique. It does it because of cycles. Looks like it applies distinct only after it traverses the whole graph to maximum depth.

Is it possible to change this behaviour somehow?

The query is:

match (a1)-[o1*1..]->(a2) WHERE a1.id = '123' RETURN distinct a2

Solution

  • You can iteratively step through the subgraph a "layer" at a time while avoiding reprocessing the same node multiple times, by using the APOC procedure apoc.periodic.commit. That procedure iteratively processes a query until it returns 0.

    Here is a example of this technique. It:

    • Uses a temporary TempNode node to keep track of a couple of important values between iterations, one of which will eventually contain the disinct ids of the nodes in the subgraph (except for the "root" node's id, since your question's query also leaves that out).
    • Assumes that all the nodes you care about share the same label, Foo, and that you have an index on Foo(id). This is for speeding up the MATCH operations, and is not strictly necessary.

    Step 1: Create TempNode (using MERGE, to reuse existing node, if any)

    WITH '123' AS rootId
    MERGE (temp:TempNode)
    SET temp.allIds = [rootId], temp.layerIds = [rootId];
    

    Step 2: Perform iterations (to get all subgraph nodes)

    CALL apoc.periodic.commit("
      MATCH (temp:TempNode)
      UNWIND temp.layerIds AS id
      MATCH (n:Foo) WHERE n.id = id
      OPTIONAL MATCH (n)-->(next)
      WHERE NOT next.id IN temp.allIds
      WITH temp, COLLECT(DISTINCT next.id) AS layerIds
      SET temp.allIds = temp.allIds + layerIds, temp.layerIds = layerIds
      RETURN SIZE(layerIds);
    ");
    

    Step 3: Use subgraph ids

    MATCH (temp:TempNode)
    // ... use temp.allIds, which contains the distinct ids in the subgraph ...