Search code examples
hadoopapache-sparkspark-jobserver

Connect from Spark-JobServer (local instance) to Hadoop


I run a virtual machine with a local instance of Hadoop and of Spark-JobServer on it. I created a file named 'test.txt' on HDFS that I want to open from the Spark-JobServer. I wrote the following code to do this:

val test1 = sc.textFile("hdfs://quickstart.cloudera:8020/test.txt")
val test2 = test1.count
return test2

However, when I want to run these lines, I get an error in the Spark-JobServer:

"Input path does not exist: hdfs://quickstart.cloudera:8020/test.txt"

I looked up the path to HDFS with hdfs getconf -confKey fs.defaultFS and it showed me hdfs://quickstart.cloudera:8020 as path. Why can I not access the test.txt file if this is the correct path to HDFS? If this is the inccorect path, how can I find the correct path?


Solution

  • Your file is not in the root directory.

    You will find your file under hdfs:///user/<your username>/test.txt

    When you do a hadoop -put without specifying a location, it will go in your user's home dir, not in the root dir.

    check the output of the following to verify this:

    hadoop fs -cat test.txt 
    hadoop fs -cat /test.txt
    

    do hadoop -put 'test.txt' /

    and see if your spark code works.