Search code examples
mavenapache-sparkhadoopjunitmapreduce

Spark Test cases not working for 2.4.0 version


We are having a application where we are using hadoop 3.1.3 and spark 2.4.0.The Map-Reduce derivations are written using Java. The Spark dataset derivations are written using Java.The Map-Reduce test cases worked fine, but the spark junit test cases failed The spark session was created properly and while loading a sample json file using spark session :

        Dataset<Row> input = sparkSession.read()
            .format("json")
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
            .option("quote", "\"")
            .option("multiline", true)
            .load("src/test/resources/samples/abcd.json");

the error occured as below :

Exception in thread "dag-scheduler-event-loop" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.InputSplit.getLocationInfo()[Lorg/apache/hadoop/mapred/SplitLocationInfo;
at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:310)

Environment variables are set as :

HADOOP_HOME=c:\hadoop-2.7.1 JAVA_HOME=c:\openjdk1.8.0_271 SPARK_HOME=c:\spark-2.4.0-bin-hadoop2.7(since spark2.4 uses hadoop 2.7.1) 

Which jar from spark should I exclude or any hadoop jar should I include ? or any other step to check ?


Solution

  • I found fix for the above problem. Have to introduce the following dependency

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>3.1.1.7.1.6.0-297</version>
        </dependency>