I have a CSV file with columns using nonstandard date format that requires a little cleaning. How can I parse this into pyarrow and write to parquet? The format is either:
%Y%m%d
ex: 19991231
%Y%m00
ex: 19991200
(Year and month, but no day specified)%Y0000
ex: 19990000
(Only year specified)0
or blank for unknown.I want to be able to parse these all into a date (or timestamp) type. The last option (0
or blank) should always become a null value. I'm happy to treat %Y%m00
as being the first day of that month and %Y0000
as first day of that year (or some similar default assumption).
I tried using pyarrow.csv.read_csv()
with convert_options=pyarrow.csv.ConvertOptions(timestamp_parsers=["%Y%m%d", "%Y%m00", "%Y0000"])
, but this led to pyarrow.lib.ArrowInvalid: In CSV column #16: CSV conversion error to timestamp[s]: invalid value '0'
.
I'm happy to do this conversion either at loading time or afterwards.
I'm not sure you'll be able to get pyarrow.csv.read_csv
to support this.
However you can load the date column as strings and convert it later using pyarrow.compute
import pyarrow.csv
import pyarrow as pa
import pyarrow as pc
table = pyarrow.csv.read_csv
# Get the date column
array = table['my_date_column'].combine_chunks()
# Replace string ending with 0000
array = pc.replace_substring_regex(array, "0000$", "0101")
# Replace string ending with 00
array = pc.replace_substring_regex(array, "00$", "01")
# Replace string equal to "0" with null
array = pc.replace_with_mask(array, pc.equal(array, "0"), pa.scalar(None, pa.string()))
# Convert to timestamp
array = pc.strptime(array, format="%Y%m%d", unit='s')
# Convert to date
array = array.cast(pa.date32())
# Put back in the table
table = table.set_column(table.schema.get_field_index("my_date_column"), "my_date_column", array)