Search code examples
neo4jtreedatabase-performance

Creating/Managing millions of Vertex Tree in Neo4j 3.0.4


I'm doing some stuff with my University and I've been asked to create a system that builds Complete Trees with millions of nodes (1 or 2 million at least). I was trying to create the Tree with a Load CSV Using a periodic commit and it worked well with the creation of just Nodes (70000 ms on a general purpose Notebook :P ). When I tried the same with the Edges, it didn't scale as well.

Using periodic commit LOAD CSV WITH HEADERS FROM 'file:///Archi.csv' AS line 
Merge (:Vertex {name:line.from})<-[:EDGE {attr1: toFloat(line.attr1), attr2:toFloat(line.attr2), attr3: toFloat(line.attr3), attr4: toFloat(line.attr4), attr5: toFloat(line.attr5)}]-(:Vertex {name:line.to})

I need to guarantee that a Tree is generated in no more than 5 minutes.

Is there a Faster method that can return such a performances?

P.S. : The task doesn't expect to use Neo4j, but just a Database (either SQL or NoSQL), but I found out this NoSQL Graph DB and I thought would be nice to implement with Neo4j as the graph data structure is given for free.

P.P.S : I'm using Cypher


Solution

  • I think you should read up on MERGE in the developer documentation again, to make sure you understand exactly what it's doing.

    A few things in particular to be aware of...

    If the pattern you are merging does not exist, all elements of the pattern will be merged, which could result in duplicate :Vertex nodes being created. If your :Vertexes are supposed to be in the database already, and if there are no relationships yet, and if you are sure that no relationship repeats itself in your CSV, I strongly urge you to MATCH on the start and end nodes, and then CREATE the relationship between them instead of the MERGE. Remember that doing a MERGE with a relationship with many attributes means it will try to match on that first, so as the number of relationships grow between nodes, there will be an increasing number of comparisons, which will slow your query down further. CREATE is a better choice if you know that no relationship will be duplicated, and if you are sure those relationships don't exist yet.

    I also urge you to create an index on :Vertex(name), as that will significantly help matching on end nodes.