Search code examples
csvneo4jcypherload

Loading CSV in Neo4j is time consuming


I want load a CDR csv file with 648000 records to neo4j (4.4.10), But it is about 4 days and And it is not yet completed.

My CSV have 648000 records with 7 columns. and the size of file is about 48 MB. My computer have 100 GB RAM and intel Zeon E5 CPU.

the columns of CSV are:

OP_Name TP_Name Called_Number OP_ANI Setup_Time Duration OP_Price

the code that I use to load CSV in Neo4j is:

```Cypher
:auto load csv with headers from 'file:///cdr.csv' as line FIELDTERMINATOR  ','
    with line
    where line['Called_Number'] is not null and line['OP_ANI'] is not null
    with line['OP_ANI'] as OP_Phone,
        (CASE line['OP_Name']
            WHEN 'TIC' THEN 'IRAN'
            ELSE 'Foreign' END) AS OP_country,
        line['Called_Number'] as Called_Phone,
        (CASE line['TP_Name']
            WHEN 'TIC' THEN 'IRAN'
            ELSE 'Foreign' END) AS TP_country,
        line['Setup_Time'] as Setup_Time, 
        line['Duration'] as Duration, 
        line['OP_Price'] as OP_Price
    
    call {
        with  OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
        
MERGE (c:Customer{phone: toInteger(Called_Phone)})
            on create set c.country = TP_country
            WITH c, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
            CALL apoc.create.addLabels( c, [ c.country ] ) YIELD node
        
MERGE (c2:Customer{phone: toInteger(OP_Phone)})
            on create set c2.country = OP_country
            WITH c2, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price, c
            CALL apoc.create.addLabels( c2, [ c2.country ] )  YIELD node
        
        MERGE (c2)-[r:CALLED{setupTime: Setup_Time, 
                    duration: Duration,
                    OP_Price: OP_Price}]->(c)
       
    } IN TRANSACTIONS

```

How can I speed up the load operation?


Solution

  • MERGE acts as an upsert in Neo4j. So the statement:

    MERGE (c:Customer{phone: toInteger(Called_Phone)})
    

    checks if there is a Customer node with the given phone number is there or not. If it is, it performs the update otherwise creates the node. When there is a large number of nodes, this lookup can be very slow, and CSV import will be slow overall. Creating an index on the phone property of Customer should do the trick. You can create the index like this:

    CREATE INDEX phone IF NOT EXISTS FOR (n:Customer) ON (n.phone)