python apache-spark dask parquet pyarrow

Parquet compatibility with Dask/Pandas and Pyspark

This is the same question as here, but the accepted answer does not work for me.

Attempt: I try to save a dask dataframe in parquet format and read it with spark.

Issue: the timestamp column can not be interpreted by pyspark

what i have done:

I try to save a Dask dataframe in hfds as parquet using

import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')

Then I read the file with pyspark:

sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()

>>>  org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96

but if i save the dataframe with

dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)

the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)

this is my Environment:

dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3

Solution

The INT96 type was explicitly included in order to allow compatibility with spark, which chose not to use the standard time type defined by the parquet spec. Unfortunately, it seems that they have changed again, and no longer use their own previous standard, not the parquet one.

If you could find out what type spark wants here, and post an issue to the dask repo, it would be appreciated. You would want to output data from spark containing time columns, and see what format it ends up as.

Did you also try the fastparquet backend?