Search code examples
pythonapache-sparkpysparkhdfsspark-submit

How to save a file on the cluster


I'm connected to the cluster using ssh and I send the program to the cluster using

spark-submit --master yarn myProgram.py

I want to save the result in a text file and I tried using the following lines:

counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")

However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?

Also, is there a way to write directly to my local machine?

EDIT: I found out that home directory doesn't exist so now I save the result as: counts.write.json("hdfs:///user/username/text_file.txt") But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?


Solution

  • Spark will save the results in multiple files since the computation is distributed. Therefore writing:

    counts.write.csv("hdfs://home/myDir/text_file.csv")
    

    means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:

    counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")
    

    This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.