Search code examples
scalaapache-spark-sqlamazon-emr

How can I configure spark so that it creates "_$folder$" entries in S3?


When I write my dataframe to S3 using

df.write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("year", "month", "day", "hour", "gen", "client")
  .option("compression", "gzip")
  .save("s3://xxxx/yyyy")

I get the following in S3

year=2018
year=2019

but I would like to have this instead:

year=2018
year=2018_$folder$
year=2019
year=2019_$folder$

The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate them.

Any idea on what hadoop or spark configuration setting control the generation of *_$folder$ files?


Solution

  • those markers a legacy feature; I don't think anything creates them any more...though they are often ignored when actually listing directories. (that is, even if there, they get stripped from listings and replaced with directory entries).