Search code examples
pythondatabaseneo4jcypherpy2neo

Database creation takes to long in neo4j


Updated question:

I'm new in neo4j and im trying to create a movie rating database. I have a file with 100.000 registers (2,3 MB) that represents when a user rated a movie; the file look like this (this dataset is from MovieLens):

dataset file

I'm using py2neo and I create de datebase with this code:

data = pd.read_csv('ratings_small.csv')
def create_bipartite_graph(data):
#Indexes creation to perform the queries
    graph.run('''
        CREATE INDEX user_index IF NOT EXISTS FOR (u:User) ON (u.UserId)
    ''')
    graph.run('''
        CREATE INDEX movie_index IF NOT EXISTS FOR (m:Movie) ON (m.MovieId)
    ''')
    actual_user_node = None
    for index, row in data.iterrows():
        userID = int(row['userId'])
        movieID = int(row['movieId'])
        rating = row['rating']
        date = datetime.fromtimestamp(row['timestamp']).strftime("%m/%d/%Y, %H:%M:%S")

        #Creation nodes and relationships
        graph.run('''
            MERGE(u:User{UserId: $uID})
            MERGE(m:Movie{MovieId: $mID})
            CREATE (u)-[:RATED_MOVIE{rating: $r, date: $d}]->(m)
        ''', parameters = {'uID': userID, 'mID': movieID, 'r': rating, 'd': date})

The problem is with that small dataset, takes more than 2 hours in create the graph. Any advice for the time of the databse creation decrease considerably?

Using LOAD CSV:

This is the query that I execute in neo4j browser

EXPLAIN LOAD CSV With HEADERS FROM 'file:///ratings_small.csv' AS line FIELDTERMINATOR ',' 
MERGE(m:Movie{MovieId: toInteger(line.movieId)})
MERGE(u:User{UserId: toInteger(line.userId)})
CREATE (u)-[:RATED_MOVIE{rating:toFloat(line.rating)}]->(m)

And this is the profile plan: profile plan


Solution

  • Every graph run call represents not only a full round trip to the server, but a completely separate transaction. Working like this, with one such call inside every cycle of the loop, is therefore incredibly inefficient.

    I suggest instead taking a look at the bulk data operations API: https://py2neo.readthedocs.io/en/stable/bulk/index.html