I have an external table in Redshift spectrum with DDL having datetime column as somewhat below:
collector_tstamp TIMESTAMP WITHOUT TIME ZONE
Objective: I am trying to parquet a certain set of data and then add the partition into Spectrum to see if values are perfectly parsed.
After creating the parquet, following is the metadata of 'collector_tstamp' from corresponding parquet file:
{"metadata": null, "field_name": "collector_tstamp", "name": "collector_tstamp", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}
Before this, I am converting the pandas dataframe column into datetime data type with the help of following:
df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'])
Issue: When I am querying the data in Redshift spectrum, I am seeing the below values so clearly the parsing is wrong but I am not sure what data type should I change to so that it parses correctly:
collector_tstamp
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
I also tried something like this but still I am getting the above value:
df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'], infer_datetime_format=True)
I also tried this but still the same value when I query it:
df['collector_tstamp'] = df['collector_tstamp'].astype('datetime64[s]')
Sample data:
collector_tstamp
2019-01-04 04:02:36
2019-01-04 04:03:41
2019-01-04 04:03:45
2019-01-04 04:04:11
Nanosecond timestamps as written by pyarrow
's default is quite new and probably not understood correctly by the current Redshift version. Have a look at the documentation at https://arrow.apache.org/docs/python/parquet.html and either try writing the files using flavor='spark'
or read about tge other settings in the section "Storing timestamps".
As you probably cannot pass all parameters correctly through pandas.DataFrame.to_parquet
, you need to use the following code to write a parquet file:
import pyarrow as pa
import pyarrow.parquet as pq
df = ..some DataFrame..
table = pa.Table.from_pandas(df)
pq.write_table(table, 'filename.parquet')