Running spark jobs on EMR with output written directly to S3. I've noticed that each S3 directory path (e.g. /the/s3/path
) contains a flag file called /the/s3/path_$folder$
. This is causing issues reloading the data with spark (it's parquet and spark is complaining about extra files, etc).
How can I stop AWS/whatever it is from creating this flag? It used to happen with hadoop jobs too so I don't think it's spark (although that uses hadoop FS stuff).
Hmm, yes I used to get these folders as well, but they no longer appear... I suspect it is because I have made these changes to the hadoopConfiguration
:
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Aside from committing output directly to S3
, these settings prevent creation of the metadata files, which apparently are of no real use anyway and simply takes up a lot of time to create.
I haven't verified that these settings will make the difference, but I strongly suspect that they do. I can check it one of these days, unless you beat me to it ;)
EDIT:
The DirectOuputCommitter
is no longer available in Spark 2.x. The way to avoid the temporary writes to S3 in Spark 2.x is to add this setting to your Spark Conf
:
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
(Note that it is no longer set on the hadoopConfiguration
). This will, however, not get rid of the _$folder$
folders. I have yet to work out how to disable them in Spark 2.x...