in my database I have multiple nodes that have the same properties, so I want to merge all of these nodes into one. Those nodes are connected by the relationship "similar", so I use gds weakly connected component to find these communities:
CALL gds.graph.project("similar_officer",["Officer"],["similar"])
CALL gds.wcc.write('similar_officer', { writeProperty: 'community' }) YIELD nodePropertiesWritten, componentCount;
So every node has the new property community which identifies all nodes in a community connected by the relationship "similar"
Here is the problem: I want to iterate the command to merge all nodes into each community into one
I tested with a single community with this code:
MATCH (a)
where a.community=235631
with collect(a) as community
CALL apoc.refactor.mergeNodes(community,{properties:"discard", mergeRels:true,preserveExistingSelfRels:false})
YIELD node
RETURN count(*)
and it worked in less than a minute and it was the biggest community so I wrote this to iterate for each community:
call apoc.periodic.iterate('
MATCH (n:Officer)
with n.community as numerocom, collect(n) as nodicom, size(collect(n)) as dimcom
return numerocom, nodicom
','
CALL apoc.refactor.mergeNodes(nodicom,{
properties:"discard", mergeRels:true, preserveExistingSelfRels:false}) YIELD node
return node
',{batchSize:10000, parallel:True})
YIELD batches, total
RETURN batches, total
The result is a never-ending query. I have 13428 communities and the biggest one has 60 nodes.
How can I fix this?
There are some issues with your query.
The main problem may be that your logic is trying to process 10000 communities (and all of their officers) in each transaction, which may be way too much data to handle per transaction.
The other issue is that you are COLLECT
ing the nodes twice per community. The second time, you are just counting the size of the collection -- but you are doing that in a very inefficient way -- instead of SIZE(COLLECT(n))
you should have just used COUNT(n)
. But even worse, you immediately throw away the count because you don't return it. So, you need to completely eliminate the counting.
This query, which addresses the above issues, may work better for you. You will have to determine for yourself the best batchSize
for your data.
CALL apoc.periodic.iterate('
MATCH (n:Officer)
WITH n.community AS com, collect(n) AS nodicom
RETURN nodicom
','
CALL apoc.refactor.mergeNodes(nodicom,{
properties:"discard", mergeRels:true, preserveExistingSelfRels:false}) YIELD node
',{batchSize: 200, parallel: true})
YIELD batches, total
RETURN batches, total