Search code examples
amazon-web-servicesapache-sparkamazon-emr

AWS EMR Spark - calculate avg of numbers from file on S3


I am trying to calculate avg of numbers given in txt file on S3 with Spark on AWS EMR.

BUt I am not sure what should I use MLib? or sparkSQL? All the references I am seeing are for totally different things. Could anyone guide me in right direction?

     SparkConf sparkConf = new 
     SparkConf().setAppName("com.company.app.JavaSparkPi");
     JavaSparkContext jsc = new JavaSparkContext(sparkConf);

    //READING S3 FILE
    //PARSING THE FILE CREATING ARRAY OF NUMBERS

    int slices = 2;
    int n = 10 * slices;
    List<Integer> l = new ArrayList<Integer>(n);
    for (int i = 0; i < n; i++) {
        l.add(i);
    }

    //NOT SURE WHAT TO DO HERE 
    //SHOULD I USE PARALLELIZE ??
    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

    int count = dataSet.map(new Function<Integer, Integer>() {
        @Override
        public Integer call(Integer integer) {
            //JUST MAP THE INTEGER TO INT?
            //OR SOME LOGIC NEEDS TO BE PLACED
            double x = Math.random() * 2 - 1;
            double y = Math.random() * 2 - 1;
            return (x * x + y * y < 1) ? 1 : 0;
        }
    }).reduce(new Function2<Integer, Integer, Integer>() {
        @Override
        public Integer call(Integer integer, Integer integer2) {
            //SOME LOGIC HERE?
            return integer + integer2;
        }
    });

    //WRITE S3
    System.out.println("Pi is roughly " + 4.0 * count / n);

    jsc.stop();

Solution

  • You probably want to use the Spark SQL/DataFrame functionality. These APIs provide you SQL like transformations that will give you better performance than the lower level RDD APIs. MLib is the machine learning component of Spark, which you don't need to do ETL operations, only if you are training a new ML model.

    You should start with some reading. First, I would start with the general spark documentation. This will give you an idea how you go about getting data into your Spark job and interacting with it.

    Spark Quick Start

    Then I would read up on EMR. Specifically on how to create a cluster, and how to access the spark shell:

    Create EMR cluster with Spark

    Accessing Spark shell on EMR cluster

    Once you are in the spark shell, you will be able to load data from S3, the same as you can from HDFS. For simple text files for example, you could just do (assuming pyspark):

    df = spark.read.text("s3://some-bucket/path/to/files/")