csv neo4j cypher graph-databases load-csv

Neo4j LOAD CSV process arrays in CSV documents using a separator

I'm using LOAD CSV to import data from csv to neo4j. My dataset contains multiple values in the country field. Currently I'm using a semicolon as the separator of those multiple values.

nodes-person.csv

id,country
http://author,country1;country2;country3

And this is the cypher query which I use to import data into neo4j

LOAD CSV WITH HEADERS FROM "file:///nodes-person.csv" AS csvLine
  MERGE (p:`person` {id: csvLine.id})
    ON CREATE
    SET
    p.country = split(csvLine.country,";")

    ON MATCH
    SET
    p.country = split(csvLine.country,";")

RETURN p;

My question is, how can I split the values properly if the values contain the separator character.

ie:

country\\;1 ; country\\;2 ; country\\;3

Solution

You've got a couple of options - one is pure Cypher and slightly untidy, the other is using APOC and regular expressions. I'm making the assumption that if the semicolon appears within a country name it's escaped with a single backslash.

Cypher route

The plan here is to do three replacements:

Replace instances of the escaped semicolon with some string that's very unlikely to appear as a country name (say, __SEMICOLON__)
Split the string by semicolons as you're doing so far
Within each string resulting from the split, replace __SEMICOLON__ instances with a semicolon character

Something like the following would work (the WITH is just so it's runnable in isolation):

WITH 'country1\\;;country2;country3\\;' as countries
RETURN [x in split(replace(countries, '\\;', '__SEMICOLON__'), ';') | replace(x, '__SEMICOLON__', ';')]

APOC and Regular Expressions

A tidier approach is to use apoc.text.replace and supply a regular expression as the 'separator', where we want to split the string by semicolons that are not preceded by the backslash escape character:

WITH 'country1\\;;country2;country3\\;' as countries
RETURN [x in apoc.text.split(countries, '(?<!\\\\);') | replace(x, '\\;', ';')]

We do a final tidy-up to replace the escaped semicolons with plain semicolons for storage with that list comprehension. The regex is shamelessly stolen from this answer.