Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame
to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket
on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet
directory I see two files under it:
_SUCCESS
; andpart-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet
living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet
directory?; ormydata.parquet_$folder$
file?; ormydata.parquet/part-<big-UUID>.snappy.parquet
?Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet
is the actual parquet data file. However, often tools like Spark break data sets into multiple part
files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS
file is a simple flag indicating that the write operation has completed.