I am studying Sparkr. I have a csv file:
a <- read.df(sqlContext,"./mine/a2014.csv","csv")
I want to use write.df to store this file. However, when I use:
write.df(a,"mine/a.csv")
I get a folder called a.csv, in which there is no csv file at all.
Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.
The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv
extension and double clicking them, or with something like head longfilename
.
You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS
files you mentioned.
If you do need all the data in one file, you can do that by using repartition
to reduce the amount of partitions to 1 and then write it:
b <- repartition(a, 1)
write.df(b,"mine/b.csv")
This will result in just one long-named file which is a CSV file with all the data.
(I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce
rather than repartition
but I couldn't find an equivalent SparkR function)