I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])