Very slow to make relationship between the nodes in py2neo

I have more than 50,000 nodes called as weblogs and I am trying to connect two node with respect to its property, my code looks like this:

#!/usr/bin/env python

from py2neo import neo4j, Node, Relationship, Graph, GraphError
from py2neo.packages.httpstream import http

http.socket_timeout = 99999
graph = Graph()
relation_counter = 0
for node in graph.find("Weblogs"):

    matches = graph.match(start_node=node, rel_type="hasDirectLinks")
    if not matches:
        continue


        for relation in matches:

            for weblog_node in graph.find("Weblogs", "entry_url", relation.end_node.properties["url"]):

               if weblog_node:
                   graph.create_unique(Relationship(node, "hasDirectLinks", weblog_node))
                   relation_counter += 1 
                   if relation_counter % 30 == 0:
                       print (relation_counter, ": Numbers of Relationship made")


print (relation_counter, ": Total numbers of relationship made")

Code is working fine but it is very slow, is there any suggestions to make it faster ?

Solution

Wow dude, you're working really hard here to try to do something (I think) is fairly simple. :) Yeah I think you can improve on this!

It seems like you're trying to match certain kinds of weblog patterns, then create new direct relationships between indirectly related weblogs. Is that right?

I've tried to reformulate your code as a single cypher query. Py2neo already lets you execute cypher directly, so I would carefully double/triple check this query, and then run something similar to it. It would replace all of the code you pasted.

MATCH (blog:Weblogs)-[matches:hasDirectLinks]->(somethingElse)
WITH matches, blog, somethingElse
MATCH (weblog_node:Weblogs)
WHERE weblog_node.entry_url = somethingElse.url
MERGE (blog)-[newRel:hasDirectLinks]->(weblog_node)
RETURN count(newRel);

(I named variables the same as what your python was doing so hopefully this is easier to follow)

Your code is running really slowly because you're connecting to a REST endpoint, and doing a lot of individual fetches of relationships, and individual scans of nodes with certain labels. This means your code spends a lot of time going back and forth to the server. If instead of manually programming which relationships get created, you use cypher, then you can do all of the nodes and all of the relationships in a single query. Once to the server and back, and you're done.

I'm betting doing this as a single cypher query is probably going to be many dozens of times faster.

Cypher is your friend! If you learn it, you're going to save yourself a lot of coding!

Happy trails!