Search code examples
scalaapache-sparkconfigurationhadoop-yarnhadoop2

spark yarn-cluster mode hdfs io file path configuration


I've tried to run basic spark wordcount example below on name-node server with psuedo-dist mode : hadoop 2.6.0

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]){

    //args(0): input file name, args(1): output dir name
    //e.g. hello.txt hello
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)

    val input = sc.textFile(args(0))
    val words = input.flatMap(_.split(" "))
    val counts = words.map((_, 1)).reduceByKey(_ + _)

    counts.saveAsTextFile(args(1))
  }
}

with start.sh file like this...

$SPARK_HOME/bin/spark-submit \
--master yarn-cluster \
--class com.gmail.hancury.hdfsio.WordCount \
./target/scala-2.10/sparktest_2.10-1.0.jar hello.txt server_hello

when I write input file path like

hdfs://master:port/path/to/input/hello.txt or
hdfs:/master:port/path/to/input/hello.txt or
/path/to/input/hello.txt

something mysterious additional path is attached automatically

/user/${user.name}/input/

so, if I wrote path like /user/curycu/input/hello.txt then there applied path like this : /user/curycu/input/user/curycu/input/hello.txt

so there occur an fileNotFound exception.

I want to know where on earth that magical path comes from...

I've checked core-site.xml, yarn-site.xml, hdfs-site.xml, mapred-site.xml, spark_env.sh, spark-defaults.conf of name-node server but there is no clue for /user/${user.name}/input


Solution

  • all above error occur when you don't use assembly jar (uber jar)...

    not sbt package
    use sbt assembly