I am importing several TB of CSV data into Neo4J for a project I have been working on. I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.
When I run the import, I see the following (I assume it exited because it ran out of memory). Is there any way I can import this large dataset with the limited amount of memory I have? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.
Available resources:
Total machine memory: 30.73GiB
Free machine memory: 14.92GiB
Max heap memory : 6.828GiB
Processors: 16
Configured max memory: 21.51GiB
High-IO: true
WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
Estimated number of nodes: 15.14 G
Estimated number of node properties: 97.72 G
Estimated number of relationships: 37.58 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 6.598TiB
Estimated required memory usage: 202.6GiB
(1/4) Node import 2022-10-08 19:01:43.953+0000
Estimated number of nodes: 15.14 G
Estimated disk space usage: 5.436TiB
Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... .......... 5% ∆1h 38m 2s 867ms
neo4j@79d2b0538617:~/import$
TL:DR; Using Periodic Commit, or Transaction Batching
If you're trying to follow the Operations Manual: Neo4j Admin Import, and your csv matches the movies.csv
in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV...
:
neo4j/import/myfile.csv
.
Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:
USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
WITH line[3] AS nodeLabels, {
id: line[0],
title: line[1],
year: toInteger(line[2])
} AS nodeProps
apoc.create.node(SPLIT(line[3],';',
Note: There are many ways to solve this problem, depending on your source data and the model you wish to create. This solution is only meant to give you a handful of tools to help you get around the memory limit. If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC
, and probably just do something like the following:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
CREATE (a :ImportedNode)
SET a = line
Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.
Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...
UNWIND [
{ file: 'country.csv', label: 'Country'},
{ file: 'city.csv', label: 'City'}
] AS importFile
USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line
CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
SET node = line
;
// then build the relationships
MATCH (city :City)
WHERE city.parent_id
MATCH (country :Country {id: city.parent_id)
MERGE (city)-[:IN]->(country)