Search code examples
apache-sparkamazon-s3pysparkparquet

Spark 2.3.3 outputing parquet to S3


A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post

I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0

Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. I usually use flintrock to orchestrate my cluster on AWS. Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? Or do I still for example have to set something in code like I did before. For example like this:

conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')

(I know this is the old code and probably no longer relevant)


Solution

  • You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970

    Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds.

    With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most.

    There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried.