Search code examples
optimizationneo4jcyphergraph-databases

Optimize Neo4j relationships creation


I have definitions (in code/files) for ~400k nodes and ~500k relationships. I need to put this on a Neo4j database. For this, I programmatically build a query to create each node with a unique id (UUID) field in it's properties.

CREATE (:MyNode {UUID: "aaaa"});
CREATE (:MyNode {UUID: "bbbb"});
...

After creating all nodes in DB I create relations with a query that looks like:

MATCH (a {UUID: "aaaa"}), (b {UUID: "bbbb"})
CREATE (a)-[:MyRelation]->(b);

MATCH (c {UUID: "cccc"}), (d {UUID: "dddd"})
CREATE (c)-[:MyRelation]->(d);

...

So running this requires ~500k MATCH statements that run over ~400k nodes. Is there a better approach to optimize this?

NOTE: unifying node and relationships creation.on same query to take advantage of variables is not practical. The query is too large.

My raw data is formated like this (In case you want to test):

{
  "nodes": [
    { "UUID": "0001" },
    { "UUID": "0002" },
    { "UUID": "0003" },
    { "UUID": "0004" },
    { "UUID": "0005" }
  ],
  "relationships": [
    { "uuid_A": "0001", "uuid_B": "0002", "label": "related_to" },
    { "uuid_A": "0001", "uuid_B": "0003", "label": "related_to" },
    { "uuid_A": "0002", "uuid_B": "0003", "label": "related_to" },
    { "uuid_A": "0002", "uuid_B": "0004", "label": "related_to" },
    { "uuid_A": "0003", "uuid_B": "0004", "label": "related_to" },
    { "uuid_A": "0003", "uuid_B": "0005", "label": "related_to" },
    { "uuid_A": "0004", "uuid_B": "0005", "label": "related_to" },
    { "uuid_A": "0004", "uuid_B": "0001", "label": "related_to" },
    { "uuid_A": "0005", "uuid_B": "0001", "label": "related_to" },
    { "uuid_A": "0005", "uuid_B": "0002", "label": "related_to" }

  ]
}


Solution

  • Here is a detailed recommendation.

    1. Create 2 CSV input files (since the data will be processed in separate queries, and to make the files more compact):

      • nodes.csv for the MyNode properties (which in your case is just a UUID). For example:

        uuid
        0001
        0002
        0003
        0004
        0005
        
      • rels.csv for the MyRelation data. For example:

        from,to,type
        0001,0002,related_to
        0001,0003,related_to
        0002,0003,related_to
        0002,0004,related_to
        0003,0004,related_to
        0003,0005,related_to
        0004,0005,related_to
        0004,0001,related_to
        0005,0001,related_to
        0005,0002,related_to
        
    2. Create a uniqueness constraint on :MyNode(uuid) to enforce that the UUIDs are indeed unique. This also automatically creates an index on :MyNode(uuid) for you. Having an index will be important for quickly finding nodes by UUID, which we will need to do when ingesting relationships below.

      CREATE CONSTRAINT MyNode_uuid FOR (m:MyNode) REQUIRE m.uuid IS UNIQUE
      
    3. Ingest the data in 2 queries using LOAD CSV.

      • First ingest the nodes. I assume your DB has no MyNode nodes to start with, so we can use CREATE instead of the slower MERGE. Also, to avoid running out of memory and to improve performance, we commit after processing every 50K nodes (you can play with the number to see what works better for you). When using IN TRANSACTIONS in the neo4j Browser, you need to prepend :auto to the front of the query (to see some docs on this, enter :help auto in the Browser).

        LOAD CSV WITH HEADERS FROM "file:///nodes.csv" AS row
        CALL {
          WITH row
          CREATE (:MyNode {uuid: row.uuid})
        } IN TRANSACTIONS OF 50000 ROWS
        
      • Then ingest the relationships. Again, you can tweak the 50K value. In order to use the index we created earlier, the MATCH clause must specify the indexed node label and property. We use the APOC function apoc.create.relationship to dynamically specify the relationship type.

        LOAD CSV WITH HEADERS FROM "file:///rels.csv" AS row
        CALL {
          WITH row
          MATCH (a:MyNode {uuid: row.from}), (b:MyNode {uuid: row.to})
          CALL apoc.create.relationship(a, row.type, NULL, b) YIELD rel
          // Need to RETURN to conform with Cypher syntax.
          // But return aggregated value so subquery only returns 1 row, which
          // will not affect the number of rows in the enclosing query.
          RETURN COUNT(*) AS cnt
        } IN TRANSACTIONS OF 50000 ROWS
        RETURN SUM(cnt) AS num_rels_added