Search code examples
hadoopamazon-web-servicesamazon-s3apache-sparkemr

Stop hadoop/EMR/AWS creating S3 paths with _$folder$ extensions


Running spark jobs on EMR with output written directly to S3. I've noticed that each S3 directory path (e.g. /the/s3/path) contains a flag file called /the/s3/path_$folder$. This is causing issues reloading the data with spark (it's parquet and spark is complaining about extra files, etc).

How can I stop AWS/whatever it is from creating this flag? It used to happen with hadoop jobs too so I don't think it's spark (although that uses hadoop FS stuff).


Solution

  • Hmm, yes I used to get these folders as well, but they no longer appear... I suspect it is because I have made these changes to the hadoopConfiguration:

    sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
    sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
    sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
    

    Aside from committing output directly to S3, these settings prevent creation of the metadata files, which apparently are of no real use anyway and simply takes up a lot of time to create.

    I haven't verified that these settings will make the difference, but I strongly suspect that they do. I can check it one of these days, unless you beat me to it ;)

    EDIT:

    The DirectOuputCommitter is no longer available in Spark 2.x. The way to avoid the temporary writes to S3 in Spark 2.x is to add this setting to your Spark Conf:

    spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    

    (Note that it is no longer set on the hadoopConfiguration). This will, however, not get rid of the _$folder$ folders. I have yet to work out how to disable them in Spark 2.x...