I have definitions (in code/files) for ~400k nodes and ~500k relationships. I need to put this on a Neo4j database. For this, I programmatically build a query to create each node with a unique id (UUID) field in it's properties.
CREATE (:MyNode {UUID: "aaaa"});
CREATE (:MyNode {UUID: "bbbb"});
...
After creating all nodes in DB I create relations with a query that looks like:
MATCH (a {UUID: "aaaa"}), (b {UUID: "bbbb"})
CREATE (a)-[:MyRelation]->(b);
MATCH (c {UUID: "cccc"}), (d {UUID: "dddd"})
CREATE (c)-[:MyRelation]->(d);
...
So running this requires ~500k MATCH statements that run over ~400k nodes. Is there a better approach to optimize this?
NOTE: unifying node and relationships creation.on same query to take advantage of variables is not practical. The query is too large.
My raw data is formated like this (In case you want to test):
{
"nodes": [
{ "UUID": "0001" },
{ "UUID": "0002" },
{ "UUID": "0003" },
{ "UUID": "0004" },
{ "UUID": "0005" }
],
"relationships": [
{ "uuid_A": "0001", "uuid_B": "0002", "label": "related_to" },
{ "uuid_A": "0001", "uuid_B": "0003", "label": "related_to" },
{ "uuid_A": "0002", "uuid_B": "0003", "label": "related_to" },
{ "uuid_A": "0002", "uuid_B": "0004", "label": "related_to" },
{ "uuid_A": "0003", "uuid_B": "0004", "label": "related_to" },
{ "uuid_A": "0003", "uuid_B": "0005", "label": "related_to" },
{ "uuid_A": "0004", "uuid_B": "0005", "label": "related_to" },
{ "uuid_A": "0004", "uuid_B": "0001", "label": "related_to" },
{ "uuid_A": "0005", "uuid_B": "0001", "label": "related_to" },
{ "uuid_A": "0005", "uuid_B": "0002", "label": "related_to" }
]
}
Here is a detailed recommendation.
Create 2 CSV input files (since the data will be processed in separate queries, and to make the files more compact):
nodes.csv
for the MyNode
properties (which in your case is just a UUID
). For example:
uuid
0001
0002
0003
0004
0005
rels.csv
for the MyRelation
data. For example:
from,to,type
0001,0002,related_to
0001,0003,related_to
0002,0003,related_to
0002,0004,related_to
0003,0004,related_to
0003,0005,related_to
0004,0005,related_to
0004,0001,related_to
0005,0001,related_to
0005,0002,related_to
Create a uniqueness constraint on :MyNode(uuid)
to enforce that the UUIDs are indeed unique. This also automatically creates an index on :MyNode(uuid)
for you. Having an index will be important for quickly finding nodes by UUID, which we will need to do when ingesting relationships below.
CREATE CONSTRAINT MyNode_uuid FOR (m:MyNode) REQUIRE m.uuid IS UNIQUE
Ingest the data in 2 queries using LOAD CSV.
First ingest the nodes. I assume your DB has no MyNode
nodes to start with, so we can use CREATE
instead of the slower MERGE
. Also, to avoid running out of memory and to improve performance, we commit after processing every 50K nodes (you can play with the number to see what works better for you). When using IN TRANSACTIONS
in the neo4j Browser, you need to prepend :auto
to the front of the query (to see some docs on this, enter :help auto
in the Browser).
LOAD CSV WITH HEADERS FROM "file:///nodes.csv" AS row
CALL {
WITH row
CREATE (:MyNode {uuid: row.uuid})
} IN TRANSACTIONS OF 50000 ROWS
Then ingest the relationships. Again, you can tweak the 50K value. In order to use the index we created earlier, the MATCH
clause must specify the indexed node label and property. We use the APOC function apoc.create.relationship to dynamically specify the relationship type.
LOAD CSV WITH HEADERS FROM "file:///rels.csv" AS row
CALL {
WITH row
MATCH (a:MyNode {uuid: row.from}), (b:MyNode {uuid: row.to})
CALL apoc.create.relationship(a, row.type, NULL, b) YIELD rel
// Need to RETURN to conform with Cypher syntax.
// But return aggregated value so subquery only returns 1 row, which
// will not affect the number of rows in the enclosing query.
RETURN COUNT(*) AS cnt
} IN TRANSACTIONS OF 50000 ROWS
RETURN SUM(cnt) AS num_rels_added