Search code examples
neo4jcollect

Collect is very slow in Neo4j


How can I roll up a pair of IDs with labels using collection on 2.2B nodes? I would like to roll the flat list of two uids with the labels connecting them without duplicates. I have a graph in neo4j composed of 10 ids, 9 connecting Ids and 1 first party id.

I am trying to create a query that for each pair of first party ids that are connected by one or more connecting Ids, I have a list of what third party ids are connecting them.

Right now I have a query as follows:

Match (u:User)-[]->(id)
match (id)<-[]-(u2:User)
where u <> u2 and ID(u) < ID(u2)
return u.uid,u2.uid,labels(id)
limit 100

which returns a list of u, u1, labels such that it looks like

u|u2|labels
uid1|uid2|["label1"] 
uid2|uid3|["label2"]
uid1|uid2|["label2"]

What I would like to do is roll up the the lists into a collection with something like

Match (u:User)-[]->(id)
match (id)<-[]-(u2:User)
where u <> u2 and ID(u) < ID(u2)
return u.uid,u2.uid,collect(labels(id))
limit 100

but it is Extremely slow and freezes out my browser. I am working with 163GB data set on a 244 GB EC2 and have given

dbms.memory.heap.initial_size=150g
dbms.memory.heap.max_size=150g
dbms.memory.pagecache.size=60g

Solution

  • The problem is that the collect() aggregation will require all results to materialize first, so that LIMIT will only be a filter at the end, which isn't going to work with the size of your db.

    Since your limit should only really apply to distinct node pairs with that pattern (regardless of how many common nodes between them), it's best to move the LIMIT up and find the common nodes between them (and their labels) after you're working with the limited set of 100.

    Give this a try:

    MATCH (u:User)-->()<--(u2:User)
    WHERE ID(u) < ID(u2)
    WITH DISTINCT u, u2
    LIMIT 100
    RETURN u, u2, [(u)-->(id)<--(u2) | labels(id)] as idLabels
    

    We're using a pattern comprehension at the end, but you could have easily done your MATCH and collect() instead.