Search code examples
timestamppyarrowapache-arrow

pyarrow: Parse non-standard date


I have a CSV file with columns using nonstandard date format that requires a little cleaning. How can I parse this into pyarrow and write to parquet? The format is either:

  • %Y%m%d ex: 19991231
  • %Y%m00 ex: 19991200 (Year and month, but no day specified)
  • %Y0000 ex: 19990000 (Only year specified)
  • 0 or blank for unknown.

I want to be able to parse these all into a date (or timestamp) type. The last option (0 or blank) should always become a null value. I'm happy to treat %Y%m00 as being the first day of that month and %Y0000 as first day of that year (or some similar default assumption).

I tried using pyarrow.csv.read_csv() with convert_options=pyarrow.csv.ConvertOptions(timestamp_parsers=["%Y%m%d", "%Y%m00", "%Y0000"]), but this led to pyarrow.lib.ArrowInvalid: In CSV column #16: CSV conversion error to timestamp[s]: invalid value '0'.

I'm happy to do this conversion either at loading time or afterwards.


Solution

  • I'm not sure you'll be able to get pyarrow.csv.read_csv to support this.

    However you can load the date column as strings and convert it later using pyarrow.compute

    import pyarrow.csv
    import pyarrow as pa
    import pyarrow as pc
    
    table = pyarrow.csv.read_csv 
    
    # Get the date column
    array = table['my_date_column'].combine_chunks()
    # Replace string ending with 0000
    array = pc.replace_substring_regex(array, "0000$", "0101")
    # Replace string ending with 00
    array = pc.replace_substring_regex(array, "00$", "01")
    # Replace string equal to "0" with null
    array = pc.replace_with_mask(array, pc.equal(array, "0"), pa.scalar(None, pa.string()))
    # Convert to timestamp
    array = pc.strptime(array, format="%Y%m%d", unit='s')
    # Convert to date
    array = array.cast(pa.date32())
    # Put back in the table
    table = table.set_column(table.schema.get_field_index("my_date_column"), "my_date_column", array)