apache-spark java-8 mapreduce emr amazon-emr

Running Spark app on EMR is slow

I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time.

For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset.

While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds. (SEE code BELOW)

This is Spark code ( 104.99 seconds):

    private double getTotalUnits (JavaRDD<DataObject> dataCollection)
{
    if (dataCollection.count() > 0) 
    {
        return dataCollection
                .mapToDouble(data -> data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }
}

And this is same Java code without using spark (0.5 seconds)

    private double getTotalOps(List<DataObject> dataCollection)
{
    if (dataCollection.size() > 0)
    {
        return dataCollection
                .stream()
                .mapToDouble(data -> data.getPrice() * data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }

}

I'm new to EMR and Spark, so I don't know, what should I do fix this problem?

UPDATE: This is a single example of the function. My whole task is to calculate different statistics(sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce

Solution

I believe you are comparing Oranges to Apples. You must understand when to use BigData vs normal Java program?

Big data is not for small size of data to process, The Bigdata framework needs to perform multiple management task in distributed environment, which is a significant overhead. The actual processing time in case of a small data may be very tiny w.r.to the time taken to manage the whole process in hadoop platform. Hence a standalone program is bount to perform better than BigData tools like mapreduce, spark etc.

If you wish to see the difference, make sure to process at least 1 TB of data through the above two program and compare the time taken to process the same.

Apart from above point, BigData brings in fault tolerance in processing. Think about it - what would happen if the JVM crashes (say OutOfMEmoryError) normal Java program execution? In normal java program, simply the whole process collapses. In Bigdata platform, the framework ensures that the processing is not halted, and failure recovery/retry process take place. This makes it fault tolerant and you do not loose the work done on other part of data just because of a crash.

Below table roughly explain, when you should switch to Big Data.