Search code examples
pandaspyarrowamazon-redshift-spectrum

Redshift spectrum incorrectly parsing Pyarrow datetime64[ns]


I have an external table in Redshift spectrum with DDL having datetime column as somewhat below:

collector_tstamp TIMESTAMP WITHOUT TIME ZONE

Objective: I am trying to parquet a certain set of data and then add the partition into Spectrum to see if values are perfectly parsed.

After creating the parquet, following is the metadata of 'collector_tstamp' from corresponding parquet file:

{"metadata": null, "field_name": "collector_tstamp", "name": "collector_tstamp", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}

Before this, I am converting the pandas dataframe column into datetime data type with the help of following:

df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'])

Issue: When I am querying the data in Redshift spectrum, I am seeing the below values so clearly the parsing is wrong but I am not sure what data type should I change to so that it parses correctly:

collector_tstamp
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01

I also tried something like this but still I am getting the above value:

df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'], infer_datetime_format=True)

I also tried this but still the same value when I query it:

df['collector_tstamp'] = df['collector_tstamp'].astype('datetime64[s]')

Sample data:

collector_tstamp
2019-01-04 04:02:36
2019-01-04 04:03:41
2019-01-04 04:03:45
2019-01-04 04:04:11

Solution

  • Nanosecond timestamps as written by pyarrow's default is quite new and probably not understood correctly by the current Redshift version. Have a look at the documentation at https://arrow.apache.org/docs/python/parquet.html and either try writing the files using flavor='spark' or read about tge other settings in the section "Storing timestamps".

    As you probably cannot pass all parameters correctly through pandas.DataFrame.to_parquet, you need to use the following code to write a parquet file:

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    df = ..some DataFrame..
    table = pa.Table.from_pandas(df)
    pq.write_table(table, 'filename.parquet')