Search code examples
javaapache-sparkhadoopamazon-emrword-count

Java+Spark wordCount with EMR


I've been trying to run the Pi Estimation & the wordCount example found on https://spark.apache.org/examples.html in Java with EMR

The Pi estimation works fine so i assumed everything was set up properly. But i get this error with the wordCount:

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://XXX/user/hadoop/input.txt

I've downloaded my input.txt & my jar from s3 before running this command:

spark-submit --class "wordCount" --master local[4] Spark05-1.1.jar input.txt

here's my wordCount code:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

public final class wordCount {

    public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");

        JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);



        JavaRDD<String> textFile = sparkContext.textFile(args[0]);
        JavaPairRDD<String, Integer> counts = textFile
                .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
                .mapToPair(word -> new Tuple2<>(word, 1))
                .reduceByKey((a, b) -> a + b);
        counts.saveAsTextFile("result.txt");


    }
}

Am i doing anything wrong?


Solution

  • If you didn't load your input.txt on hdfs, please try after put it into the hdfs.

    Or, try with full path with prefix 'file' e.g) file://{YOUR_FILE_PATH}.
    I believe it because 'fs.defaultFS' from spark config is 'hdfs'.