Search code examples
pythonparquetapache-drillpyarrow

Apache-Drill doesn't understand Pandas datetime64[ns]


I'm using Pyarrow, Pyarrow.Parquet as well as Pandas. When I send a Pandas datetime64[ns] series to a Parquet file and load it again via a drill query, the query shows an Integer like: 1467331200000000 which seems to be something else than a UNIX timestamp.

The query looks like this:

SELECT workspace.id-column AS id-column, workspace.date-column AS date-column

When I open that file within Python again, it loads correctly and still has its datetime64[ns] type.

Any idea what's going wrong and how to solve this? I want this value being shown as a regular date.


Solution

  • Ok, I found a solution some days ago which I would like to share. I think I initially missed something. It's very important to downcast to [ms] as well as allowing truncating timestamps before sending the dataframe to Parquet for becoming able to open it issue free in Drill:

    pq.write_table(table, rf'{name}.parquet',
               coerce_timestamps='ms',
               allow_truncated_timestamps=True)
    

    When I define a view in Drill I can cast that column as date or timestamp as required.