Search code examples
neo4jbigdatalarge-data

Neo4J Very Large Admin Import with limited RAM


I am importing several TB of CSV data into Neo4J for a project I have been working on. I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.

When I run the import, I see the following (I assume it exited because it ran out of memory). Is there any way I can import this large dataset with the limited amount of memory I have? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.

Available resources:
  Total machine memory: 30.73GiB
  Free machine memory: 14.92GiB
  Max heap memory : 6.828GiB
  Processors: 16
  Configured max memory: 21.51GiB
  High-IO: true

WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
  Estimated number of nodes: 15.14 G
  Estimated number of node properties: 97.72 G
  Estimated number of relationships: 37.58 G
  Estimated number of relationship properties: 0.00 
  Estimated disk space usage: 6.598TiB
  Estimated required memory usage: 202.6GiB

(1/4) Node import 2022-10-08 19:01:43.953+0000
  Estimated number of nodes: 15.14 G
  Estimated disk space usage: 5.436TiB
  Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... ..........   5% ∆1h 38m 2s 867ms
neo4j@79d2b0538617:~/import$

Solution

  • TL:DR; Using Periodic Commit, or Transaction Batching

    If you're trying to follow the Operations Manual: Neo4j Admin Import, and your csv matches the movies.csv in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV...:

    1. Stop the db.
    2. Put your csv at neo4j/import/myfile.csv.
      • If you're using Desktop: Project > DB > click the ... on the right > Open Folder
    3. Add the APOC plugin.
    4. Start the DB.

    Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:

    USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
    WITH line[3] AS nodeLabels, {
      id: line[0],
      title: line[1],
      year: toInteger(line[2])
    } AS nodeProps
    apoc.create.node(SPLIT(line[3],';',
    

    Note: There are many ways to solve this problem, depending on your source data and the model you wish to create. This solution is only meant to give you a handful of tools to help you get around the memory limit. If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC, and probably just do something like the following:

    USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
    CREATE (a :ImportedNode)
    SET a = line
    

    File for Each Label

    Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.

    Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...

    UNWIND [
      { file: 'country.csv', label: 'Country'},
      { file: 'city.csv', label: 'City'}
    ] AS importFile
    USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line
    
    CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
    SET node = line
    ;
    
    // then build the relationships
    MATCH (city :City) 
    WHERE city.parent_id
    MATCH (country :Country {id: city.parent_id)
    MERGE (city)-[:IN]->(country)