I want to calculate betweenness in a very large graph in neo4j using py2neo.
I am using a cypher query like this:
MATCH p=allShortestPaths((source:DOLPHIN)-[*]-(target:DOLPHIN))
WHERE id(source) < id(target)
AND length(p) > 1
UNWIND nodes(p)[1..-1] as n
RETURN n.name, count(*) as betweenness
ORDER BY betweenness DESC
It is working for small graph but not working for a large graph with 1 million nodes. I have passed this query using py2neo.
Earlier I was getting error timeout which have resolved but now after running for sometime it is saying query cannot be processed. I am getting following error:-
File "/usr/local/lib/python2.7/dist-packages/py2neo/cypher/core.py", line 111, in execute
results = tx.commit()
File "/usr/local/lib/python2.7/dist-packages/py2neo/cypher/core.py", line 306, in commit
return self.post(self.__commit or self.__begin_commit)
File "/usr/local/lib/python2.7/dist-packages/py2neo/cypher/core.py", line 261, in post
raise self.error_class.hydrate(error)
py2neo.cypher.error.statement.ExecutionFailure: The statement has been closed.
I have searched a lot about it. Please help me with this
I can't comment on the algorithm/approach you use to rank the authors. Ultimately tough, the query you're running is a full graph search, with some aggregation. Neo4j was not designed for such cases. As your data increases, it will be harder to run the query.
Ideally, a query should only traverse a small section of the graph. So for your case, instead of asking who is the most popular, you could ask what the rank is for each author, on each query. Doing this for all of them, one at a time, and ranking them yourself might work better here. Unless you take a different approach, like limit the range of neighbour nodes to traverse, or the length of the longest path, or even both. But I suspect it would affect your result.
I would advise you to re-look at your domain model, based on your needs, and figure out a design model that can help you easily answer your questions, like who is the most popular author, based on your calculation approach. And double check to make sure you're using Indexes, just in case.
Modelling with neo4j:
Sometimes the simplest model doesn't help us answer certain questions; I've had to remodel a few times myself, and turn relationships into nodes for temporal data sorting, cause it wasn't obvious the first time around. Anyways, I hope you figure out a solution.
Cheers