I have the following schema in Neo4j:
(:Foo)-[:HAS]->(:Bar)
(:Foo)<-[:IN]-(:Baz)
There are tens of millions of (:Foo)
, with potentially millions relationships between (:Foo)
and (:Baz)
.
I want to get the biggest (:Foo)
(in terms of a number of relationships between (:Foo)
and (:Baz)
) which do not have any relationships with (:Bar)
.
I was trying:
MATCH (f:Foo) WHERE NOT (f)-[:HAS]->(:Bar)
WITH f, count([(f)<-[:IN]-()]) as b_count WHERE b_count > 10 RETURN f, b_count
but that query never finishes.
I have also tried using apoc.periodic.iterate
, but I don't know how to get the result.
CALL apoc.periodic.iterate(
"MATCH (f:Foo) WHERE NOT (f)-[:HAS]->(:Bar) RETURN f",
"WITH f, count([(f)<-[:IN]-()]) as b_count WHERE b_count > 10 RETURN f, b_count",
{parallel:true, batchSize:1000})
Ideally, I would like to get the results sorted by b_count
and return only the N
biggest.
Sorting all the results to only get the biggest N
might be too memory-demanding.
If the results could be saved to a file, I could use sort
to order the results afterwards.
EDIT:
If possible, the query should be neo4j 3.5 compatible.
As mentioned in this answer, COUNT subqueries allow you to take advantage of the very efficient getDegree
operation (by avoiding any DB hits).
If all HAS
relationships from a Foo
node end in a Bar
node, then you can simplify your first pattern to (f)-[:HAS]->()
to take advantage of the getDegree
operation twice in the same query:
MATCH (f:Foo)
WHERE COUNT { (f)-[:HAS]->() } = 0
WITH f, COUNT { (f)<-[:IN]-() } AS b_count
WHERE b_count > 10
RETURN f, b_count
This query should be very fast.
If you are using a version of neo4j older than 5.0, you should be able to replace COUNT { ... }
with SIZE(...)
to use the getDegree
operation. Here is a knowledge base article about that.