Search code examples
rapache-sparksparkr

How to use write.df store a csv file when using Sparkr and Rstudio?


I am studying Sparkr. I have a csv file:

a <- read.df(sqlContext,"./mine/a2014.csv","csv")

I want to use write.df to store this file. However, when I use:

write.df(a,"mine/a.csv")

I get a folder called a.csv, in which there is no csv file at all.


Solution

  • Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.

    The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv extension and double clicking them, or with something like head longfilename.

    You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS files you mentioned.

    If you do need all the data in one file, you can do that by using repartition to reduce the amount of partitions to 1 and then write it:

    b <- repartition(a, 1)
    write.df(b,"mine/b.csv")
    

    This will result in just one long-named file which is a CSV file with all the data.

    (I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce rather than repartition but I couldn't find an equivalent SparkR function)