Search code examples
javaamazon-web-servicesapache-sparkamazon-emr

SparkContext Java Deploy Job and MapReduce from AWS EMR


Hi was searching the web and amazon documentation for a general know how on to running a spark job on an existing emr yarn cluster on aws.

I'm stuck in the following. I have already setup a local[*] spark cluster to test; now I want to test it on aws emr.

So, far I have created a emr cluster on aws and cannot find documentation on running the following code. This works locally if

"spark.master.url" is set as local[*]

Class code:

public class SparkLocalImpl implements DataMapReduce{

private static SparkConf conf;
private JavaSparkContext sparkContext;

private void createContext(){
    conf = new SparkConf().setMaster(env.getProperty("spark.master.url"));//rest is default
    sparkContext = new JavaSparkContext(conf);
}

public List<String> getMapReducedData(List<String> str){
    createContext();
    JavaRDD<String> rdd = sparkContext.parallelize(str);
    return rdd.map(eachStr->customMapFunction(eachStr))
            .collect()
                .stream()
                .flatMap(x -> x.stream())
                .collect(Collectors.toList());


}

public List<String> customMapFunction(String str){
List<String> strMappedList= new ArrayList();
 //do something 

 return strMappedList;
} 
}

Can someone tell me what I am doing wrong?


Solution

  • AWS EMR doesn't support standalone spark cluster mode. It supports cluster & client modes.

    However, try using AWS Glue. Looking at your code, it looks like simple ETL job. Besides AWS glue does support a

    GlueContext which is an custom implementation of a SparkContext

    You can find that mentioned here.

    https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster

    Also check apache livy on emr