Search code examples
duckdb

Gracefully handling errors when parsing dates in DuckDB


While ingesting some very poor quality data I'd like to use strptime and nullify any entries that error on strptime.

I can manually specify exclusions with a case statement:

select * EXCLUDE (date_of_birth),
CASE
   WHEN date_of_birth[5:6]::int <= 12 AND 
        date_of_birth[5:6]::int > 0 AND
        date_of_birth[7:]::int <= 31 AND
        NOT (date_of_birth[5:6]::int = 11 AND date_of_birth[7:]::int >= 31) AND
        NOT (date_of_birth[5:6]::int = 2 AND date_of_birth[7:]::int >= 29)
   THEN strptime(date_of_birth, '%Y%m%d')::DATE
   ELSE NULL
END AS date_of_birth

from read_csv_auto(
    'bad_data.csv', 
    types={'date_of_birth': 'VARCHAR'}
)

But this method is very unsatisfactory. For one I can't anticipate all possible bad dates - and in this instance I'm not able to change the source system. Can anyone point me in a better direction?


Solution

  • The newer try_strptime() function introduced by @hawkfish in a comment last year is the most concise (and likely most robust) solution.

    SELECT
      try_strptime(date_of_birth, '%Y%m%d')::date    as date_of_birth
    FROM ...