Search code examples
neo4jcypherneo4j-apoc

Handling empty array types when using APOC to import a CSV to neo4j


I have a csv file wherein some fields are array-types. Fields are separated with , and array items are separated with ;. For example:

index, name, friends, neighbors
0,Jim,John;Tim;Fred,Susan;Megan;Cheryl
1,Susan,Jim;John,Megan;Cheryl
2,Sean,,,

where Jim has three friends, John, Tim, and Fred, and three neighbors, Susan, Megan, and Cheryl, and Sean has no friends and no neighbors.

However, when I read this into neo4j using apoc.load.csv, I end up with list properties with empty strings inside of them (rather than empty lists). For example:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file.csv',
    {header:true,sep:',',
    mapping:{
        friends:{array:true},
        neighbors:{array:true}}
    })
YIELD map as row RETURN row
","
CREATE (p:Person) SET p = row
", 
{batchsize:50000, iterateList:true, parallel:true});

Gives me a Person with name Sean but with friends=[ "" ] and neighbors=[ "" ].

What I want is Sean to have friends=[] and neighbors=[].

Thank you!


Solution

    1. Make sure there are no extraneous spaces in your CSV file header (or else some property names will start with a space):

      index,name,friends,neighbors
      0,Jim,John;Tim;Fred,Susan;Megan;Cheryl
      1,Susan,Jim;John,Megan;Cheryl
      2,Sean,,,
      
    2. Use list comprehension to help eliminate all friends and neighbors elements that are empty strings:

      CALL apoc.periodic.iterate(
        "CALL apoc.load.csv(
           'file.csv',
           {
             header:true, sep:',',
             mapping: {
               friends: {array: true},
               neighbors: {array: true}
             }
           }) YIELD map
         RETURN map
        ",
        "CREATE (p:Person)
         SET p = map
         SET p.friends = [f IN p.friends WHERE f <> '']
         SET p.neighbors = [n IN p.neighbors WHERE n <> '']
        ", 
        {batchsize:50000, iterateList:true, parallel:true}
      );
      

    With the above changes, this query:

    MATCH (person:Person) RETURN person;
    

    returns this result:

    ╒══════════════════════════════════════════════════════════════════════╕
    │"person"                                                              │
    ╞══════════════════════════════════════════════════════════════════════╡
    │{"name":"Jim","index":"0","neighbors":["Susan","Megan","Cheryl"],"frie│
    │nds":["John","Tim","Fred"]}                                           │
    ├──────────────────────────────────────────────────────────────────────┤
    │{"name":"Susan","index":"1","neighbors":["Megan","Cheryl"],"friends":[│
    │"Jim","John"]}                                                        │
    ├──────────────────────────────────────────────────────────────────────┤
    │{"name":"Sean","index":"2","neighbors":[],"friends":[]}               │
    └──────────────────────────────────────────────────────────────────────┘
    

    [UPDATED]

    Also, if it is not possible for your CSV file to contain an "empty" friend or neighbor substring (e.g., John;;Fred), then this version of the query that uses CASE instead of list comprehension would be more efficient:

    CALL apoc.periodic.iterate(
      "CALL apoc.load.csv(
         'file.csv',
         {
           header:true, sep:',',
           mapping: {
             friends: {array: true},
             neighbors: {array: true, arraySep:';'}
           }
         }) YIELD map
       RETURN map
      ",
      "CREATE (p:Person)
         SET p = map
         SET p.friends = CASE p.friends WHEN [''] THEN [] ELSE p.friends END
         SET p.neighbors = CASE p.neighbors WHEN [''] THEN [] ELSE p.neighbors END
      ", 
      {batchsize:50000, iterateList:true, parallel:true}
    );