We are having a application where we are using hadoop 3.1.3 and spark 2.4.0.The Map-Reduce derivations are written using Java. The Spark dataset derivations are written using Java.The Map-Reduce test cases worked fine, but the spark junit test cases failed The spark session was created properly and while loading a sample json file using spark session :
Dataset<Row> input = sparkSession.read()
.format("json")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("quote", "\"")
.option("multiline", true)
.load("src/test/resources/samples/abcd.json");
the error occured as below :
Exception in thread "dag-scheduler-event-loop" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.InputSplit.getLocationInfo()[Lorg/apache/hadoop/mapred/SplitLocationInfo;
at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:310)
Environment variables are set as :
HADOOP_HOME=c:\hadoop-2.7.1 JAVA_HOME=c:\openjdk1.8.0_271 SPARK_HOME=c:\spark-2.4.0-bin-hadoop2.7(since spark2.4 uses hadoop 2.7.1)
Which jar from spark should I exclude or any hadoop jar should I include ? or any other step to check ?
I found fix for the above problem. Have to introduce the following dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.1.1.7.1.6.0-297</version>
</dependency>