Search code examples
csvneo4jcyphergraph-databasesload-csv

Neo4j LOAD CSV process arrays in CSV documents using a separator


I'm using LOAD CSV to import data from csv to neo4j. My dataset contains multiple values in the country field. Currently I'm using a semicolon as the separator of those multiple values.

nodes-person.csv

id,country
http://author,country1;country2;country3

And this is the cypher query which I use to import data into neo4j

LOAD CSV WITH HEADERS FROM "file:///nodes-person.csv" AS csvLine
  MERGE (p:`person` {id: csvLine.id})
    ON CREATE
    SET
    p.country = split(csvLine.country,";")

    ON MATCH
    SET
    p.country = split(csvLine.country,";")

RETURN p;

My question is, how can I split the values properly if the values contain the separator character.

ie:

country\\;1 ; country\\;2 ; country\\;3

Solution

  • You've got a couple of options - one is pure Cypher and slightly untidy, the other is using APOC and regular expressions. I'm making the assumption that if the semicolon appears within a country name it's escaped with a single backslash.

    Cypher route

    The plan here is to do three replacements:

    • Replace instances of the escaped semicolon with some string that's very unlikely to appear as a country name (say, __SEMICOLON__)
    • Split the string by semicolons as you're doing so far
    • Within each string resulting from the split, replace __SEMICOLON__ instances with a semicolon character

    Something like the following would work (the WITH is just so it's runnable in isolation):

    WITH 'country1\\;;country2;country3\\;' as countries
    RETURN [x in split(replace(countries, '\\;', '__SEMICOLON__'), ';') | replace(x, '__SEMICOLON__', ';')]
    

    enter image description here

    APOC and Regular Expressions

    A tidier approach is to use apoc.text.replace and supply a regular expression as the 'separator', where we want to split the string by semicolons that are not preceded by the backslash escape character:

    WITH 'country1\\;;country2;country3\\;' as countries
    RETURN [x in apoc.text.split(countries, '(?<!\\\\);') | replace(x, '\\;', ';')]
    

    enter image description here

    We do a final tidy-up to replace the escaped semicolons with plain semicolons for storage with that list comprehension. The regex is shamelessly stolen from this answer.