Search code examples
neo4jcypherquery-optimization

Estimating time to set relationships in Neo4j


In a general sense, is there a best practice to use when attempting to estimate how long the setting of relationships takes in Neo4j?

For example, I used the data import tool successfully, and here's what I've got in my 2.24GB database:

IMPORT DONE in 3m 8s 791ms. Imported: 7432663 nodes 0 relationships 119743432 properties

In preparation for setting relationships, I set some indices:

CREATE INDEX ON :ChessPlayer(player_id);
CREATE INDEX ON :Matches(player_id);

Then I let it rip:

MATCH (p:Player),(m:Matches)
WHERE p.player_id = m.player_id
CREATE (p)-[r:HAD_MATCH]->(m)

Then, I started to realize, that I have no idea how to even estimate how long that setting these relationships might take to set. Is there a 'back of the envelope' calculation for determining at least a ballpark figure for this kind of thing?

I understand that everyone's situation is different on all levels, including software, hardware, and desired schema. But any discussion would no doubt be useful and would deepen mine (and anyone else who reads this)'s understanding.

PS: FWIW, I'm running Ubuntu 14.04 with 16GB RAM and an Intel Core i7-3630QM CPU @ 2.40GHz


Solution

  • The problem here is that you don't take into account transaction sizes. In your example all :HAD_MATCH relationships are created in one single large transaction. A transaction internally builds up in memory first and then gets flushed to disc. If the transaction is too large to fit in your heap you'll might see massive performance degradation due to garbage collections or even OutOfMemoryExceptions.

    Typically you want to limit transaction sizes to e.g. 10k - 100k atomic operations.

    The probably most easy to do transaction batching in this case is using the rock_n_roll procedure from neo4j-apoc. This uses one cypher statement to provide the data to be worked on and a second one running for each of the results from the previous one in batched mode. Note that apoc requires Neo4j 3.x:

    CALL apoc.periodic.rock_n_roll(
       "MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
       "WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
       20000)
    

    There was a bug in 3.0.0 and 3.0.1 causing this performing rather badly. So the above is for Neo4j >= 3.0.2.

    If being on 3.0.0 / 3.0.1 use this as a workaround:

    CALL apoc.periodic.rock_n_roll(
       "MATCH (p:Player),(m:Matches) WHERE p.player_id = m.player_id RETURN p,m",
       "CYPHER planner=rule WITH {p} AS p, {m} AS m CREATE (p)-[:HAD_MATCH]->(m)",
       20000)