Search code examples
pythonaws-lambdaparquetamazon-athenapyarrow

Are parquet file created with pyarrow vs pyspark compatible?


I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing

df.repartition(*partitionby).write.partitionBy(partitionby).
    mode("append").parquet(output,compression=codec)

however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So, basically:

import pyarrow.parquet as pq
pq.write_table(table, outputPath, compression='snappy',
    use_deprecated_int96_timestamps=True)

I wanted to know if the Parquet files written by both PySpark and PyArrow will be compatible (with respect to Athena)?


Solution

  • Parquet file written by pyarrow (long name: Apache Arrow) are compatible with Apache Spark. But you have to be careful which datatypes you write into the Parquet files as Apache Arrow supports a wider range of them then Apache Spark does. There is currently a flag flavor=spark in pyarrow that you can use to automatically set some compatibility options so that Spark can read these files in again. Sadly in the latest release, this option is not sufficient (expect to change with pyarrow==0.9.0). You should take care to write out timestamps using the deprecated INT96 type (use_deprecated_int96_timestamps=True) as well as avoiding unsigned integer columns. For the unsigned integer columns, convert them simply to a signed integer. Sadly Spark errors out if you have a unsigned type in your schema instead of just loading them as signed (they are actually always stored as signed, but only marked with a flag as unsigned). Respecting these two things, the files should be readable in Apache Spark and AWS Athena (which is just Presto under the hood).