Search code examples
python-3.xpandasdataframeparquet

python3 - to_parquet data format


dataframe with the datetime64[ns] column as parquet using pandas.to_parquet and read the parquet file, the datetime64[ns] column is converted to unixtimestamp.

eg. 2022-10-05 19:31:57.894835 -> 1664998317894835000

Is it not possible to save the datetime64[ns] column as it is?


Solution

  • datetime64[ns] format of a pd.DataFrame is a dtype specific of Pandas, or to be more precise, of NumPy.

    This is not comparable to the types supported by the parquet-format types source: Apache's official parquet docs.

    You should also check which engine you are using to generate the parquet file. According to pandas API reference of the to_parquet, if not explicitly specified, it probably defaults to pyarrow.

    If pyarrow is your engine, then this type differences are holding:

    https://arrow.apache.org/docs/python/pandas.html#type-differences

    Always arrow documentation suggest the proper handing:

    If you want to use NumPy’s datetime64 dtype instead, pass date_as_object=False:

    In [26]: s2 = pd.Series(arr.to_pandas(date_as_object=False))
    
    In[27]:  s2.dtype
    Out[27]: dtype('<M8[ns]')
    

    Bonus track >> If reading / reloading of is performed in Spark, you can later use datetime functions in order to convert the unix timestamp, spark-sql datetime.