Search code examples
apache-sparkamazon-s3parquet

Parquet file format on S3: which is the actual Parquet file?


Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:

myDF.write.mode(SaveMode.Overwrite)
  .parquet("s3n://com.example.mybucket/mydata.parquet")

When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:

  • _SUCCESS; and
  • part-<big-UUID>.snappy.parquet

Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.

Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:

  1. mydata.parquet directory?; or
  2. mydata.parquet_$folder$ file?; or
  3. mydata.parquet/part-<big-UUID>.snappy.parquet?

Thanks!


Solution

  • The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.