Search code examples
neo4jcypher

How to batch process millions of nodes and save the result into file


I have the following schema in Neo4j:

(:Foo)-[:HAS]->(:Bar)
(:Foo)<-[:IN]-(:Baz)

There are tens of millions of (:Foo), with potentially millions relationships between (:Foo) and (:Baz).

I want to get the biggest (:Foo) (in terms of a number of relationships between (:Foo) and (:Baz)) which do not have any relationships with (:Bar).

I was trying:

MATCH (f:Foo) WHERE NOT (f)-[:HAS]->(:Bar)
WITH f, count([(f)<-[:IN]-()]) as b_count WHERE b_count > 10 RETURN f, b_count

but that query never finishes.

I have also tried using apoc.periodic.iterate, but I don't know how to get the result.

CALL apoc.periodic.iterate(
"MATCH (f:Foo) WHERE NOT (f)-[:HAS]->(:Bar) RETURN f", 
"WITH f, count([(f)<-[:IN]-()]) as b_count WHERE b_count > 10 RETURN f, b_count",
 {parallel:true, batchSize:1000})

Ideally, I would like to get the results sorted by b_count and return only the N biggest.

Sorting all the results to only get the biggest N might be too memory-demanding. If the results could be saved to a file, I could use sort to order the results afterwards.

EDIT:

If possible, the query should be neo4j 3.5 compatible.


Solution

  • As mentioned in this answer, COUNT subqueries allow you to take advantage of the very efficient getDegree operation (by avoiding any DB hits).

    If all HAS relationships from a Foo node end in a Bar node, then you can simplify your first pattern to (f)-[:HAS]->() to take advantage of the getDegree operation twice in the same query:

    MATCH (f:Foo)
    WHERE COUNT { (f)-[:HAS]->() } = 0
    WITH f, COUNT { (f)<-[:IN]-() } AS b_count
    WHERE b_count > 10
    RETURN f, b_count
    

    This query should be very fast.

    Prior to neo4j 5.0

    If you are using a version of neo4j older than 5.0, you should be able to replace COUNT { ... } with SIZE(...) to use the getDegree operation. Here is a knowledge base article about that.