Search code examples
csvmemgraphdb

Handling null values with LOAD CSV and the date() function in Memgraph


I've been tasked with importing a significant amount of data from various CSV files. So far, the process has been pretty smooth, but I've hit a snag with one particular CSV.

In this file, named mydata.csv, there's a column labeled endDate that contains date values.

id,customer,typeOfMobilePlan,endDate
8475932051,John Smith,Unlimited Data,2023-09-30
2938475610,Jane Doe,Basic Plan,2023-10-01
5674029183,Alice Johnson,Family Plan,2023-10-02
1248097536,Robert Brown,Prepaid Plan,
9385724160,Mary Davis,Unlimited Data,2023-10-04
2568931470,James Wilson,Family Plan,2023-10-05
5849021376,Patricia Taylor,Pay As You Go,
2048573912,Jennifer Lee,Basic Plan,2023-10-07
7385912046,Michael Harris,Unlimited Data,20231008
9032857461,Emily Clark,Family Plan,2023-10

My intention was to utilize these dates when creating nodes in Memgraph. My query looks something like this:

LOAD CSV FROM "file:///imports/mydata.csv" WITH HEADER AS row
CREATE (c: Thing {
    id: row.id,
    endDate: date(row.endDate)
})

The problem is that not every row in the endDate column of my CSV has a date. Some rows have null values or are simply empty. As a result I get an error upon encountering these non-date values:

Query failed: Invalid string for date. 
String representing the date should be in one of the following formats:
- YYYY-MM-DD
- YYYYMMDD
- YYYY-MM

I understand the error; the date function expects a specific format. I'm at a bit of a loss regarding how to best modify my query. I still want to parse and use the valid date strings, but I also need a way to handle, or perhaps skip, the problematic null or empty values. I can't skip those records so I can't prefilter and leave the rows from my CSV file out.


Solution

  • I think that this Cypher query should do the trick:

    LOAD CSV FROM "/imports/mydata.csv" WITH HEADER AS row
    CREATE (c: Thing {
        id: row.id,
        startDate: CASE row.startDate =~ "^\\s*\\d{4}(\\d{4}|-\\d{2}(-\\d{2})?)\\s*$"
            WHEN true
            THEN date(trim(row.startDate))
            ELSE null
            END
    })