Search code examples
javaapache-sparkhadooprdd

spark SAVEASTEXTfile is taking lot of time - 1.6.3


I Extract Data from Mongo. Process the data and then store the data in HDFS.

Extraction and Processing of 1M records completes is less than 1.1 Minute.

Extraction Code

JavaRDD<Document> rdd = MongoSpark.load(jsc);

Processing Code

              JavaRDD<String> fullFile = rdd.map(new Function<Document, String>() {

                           public String call(Document s) {
//                         System.out.println(" About to Transform Json ----- " + s.toJson());
                            return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()),args[3].split(","),extractionDetails);
                }
         });
System.out.println("Records Downloaded - " + fullFile.count());  

This complete is less than 1.1 Minute. As i fetch the count of RDD at that point.

After that i have Save command which as follows,

  fullFile
   .coalesce(1)
   .saveAsTextFile(args[4], GzipCodec.class);

This takes atleast 15 to 20 min to save it into HDFS.

Not sure why it takes much time. Let me know if anything can be done to faster the process.

I am using the following options to run it, --num-executors 4 --executor-memory 4g --executor-cores 4

If i increase the # of executors or Memory , still it does not make any differences. I have set the # of Partitions to 70 , not sure if i increase this there might be performance ?

Any suggestion to reduce the time of Save will be helpfull.

Thanks in Advance


Solution

  • fullFile
       .coalesce(1)
       .saveAsTextFile(args[4], GzipCodec.class);
    

    Here you're using coalesce(1) means you're reducing no. of partition to 1 only that's why it is taking more time. As Their is only one partition at the time of writing so only one task/executor will write the whole data in desired location. If you want to write faster than increase the partition value in coalesce. Simply remove coalesce or increase value in coalesce. You can no. partition while writing data in spark UI.