I am trying to calculate avg of numbers given in txt file on S3 with Spark on AWS EMR.
BUt I am not sure what should I use MLib? or sparkSQL? All the references I am seeing are for totally different things. Could anyone guide me in right direction?
SparkConf sparkConf = new
SparkConf().setAppName("com.company.app.JavaSparkPi");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//READING S3 FILE
//PARSING THE FILE CREATING ARRAY OF NUMBERS
int slices = 2;
int n = 10 * slices;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}
//NOT SURE WHAT TO DO HERE
//SHOULD I USE PARALLELIZE ??
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
int count = dataSet.map(new Function<Integer, Integer>() {
@Override
public Integer call(Integer integer) {
//JUST MAP THE INTEGER TO INT?
//OR SOME LOGIC NEEDS TO BE PLACED
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y < 1) ? 1 : 0;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) {
//SOME LOGIC HERE?
return integer + integer2;
}
});
//WRITE S3
System.out.println("Pi is roughly " + 4.0 * count / n);
jsc.stop();
You probably want to use the Spark SQL
/DataFrame
functionality. These APIs provide you SQL like transformations that will give you better performance than the lower level RDD
APIs. MLib
is the machine learning component of Spark, which you don't need to do ETL operations, only if you are training a new ML model.
You should start with some reading. First, I would start with the general spark documentation. This will give you an idea how you go about getting data into your Spark job and interacting with it.
Then I would read up on EMR. Specifically on how to create a cluster, and how to access the spark shell:
Accessing Spark shell on EMR cluster
Once you are in the spark shell, you will be able to load data from S3, the same as you can from HDFS. For simple text files for example, you could just do (assuming pyspark):
df = spark.read.text("s3://some-bucket/path/to/files/")