My goal is to accumulate all the data from every Spark worker into a single file.
I read some article with the solution for a similar problem where author encouraged to use org.apache.hadoop.fs.FileUtil#copyMerge
method for such purposes. I decided to implement it in my project and here is what I have:
try (JavaSparkContext sparkCtx = new JavaSparkContext(sparkConf)) {
// reading, transforming and storing RDDs to the text files
FileUtil.copyMerge(...) // merge them altogether into the single file
} // 'try-with-resources' eventually closes spark context
While implementing this approach I got confused: if I run this code I will eventually run it on every worker instance and they will overwrite each other. What happens if some worker will not finish its job? Every worker will have its own copy the final single file?
I realized that I need to find some place/method to guarantee that all workers have stopped their executions and where I can start data accumulation.
How this can be achieved? My guess is to run this data accumulation after try-with-resources
block, is that correct?
FileUtil
is completely independent of Spark and doesn't use Spark workers or executors.
If you want to make sure it is executed after Spark application has finished you can call it right after you stop the context.
sparkCtx.stop();
FileUtil.copyMerge(...)