apache-spark apache-spark-sql spark-hive

Spark CSV IOException Mkdirs failed to create file

TL;DR

Spark 1.6.1 fails to write a CSV file using Spark CSV 1.4 on a standalone cluster with no HDFS with IOException Mkdirs failed to create file

More details:

I'm working on a Spark 1.6.1 application running it on a standalone cluster using a local filesystem (the machine I'm running on doesn't even have HDFS on it) with Scala. I have this dataframe that I'm trying to save as a CSV file using HiveContext.

This is what I'm running:

exportData.write
      .mode(SaveMode.Overwrite)
      .format("com.databricks.spark.csv")
      .option("delimiter", ",")
      .save("/some/path/here") // no hdfs:/ or file:/ prefix in the path

The Spark CSV that I'm using is 1.4. When running this code I get the following exception:

WARN  TaskSetManager:70 - Lost task 4.3 in stage 10.0: java.io.IOException: Mkdirs failed to create file: /some/path/here/_temporary/0

The full stacktrace is:

at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
        at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
        at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The output dir does get created but its empty.

I tried running it using the spark shell, what I did is create a dummy dataframe and then save it using the exact same code to save (also to the same path). It succeeded.

I checked the permissions for the folder I'm writing to and changed it to 777 but basically it still doesn't work when running the Spark job

Googling it suggested:

changing the file prefix by removing hdfs:/ which I don't have anyway. I also tried adding file:/, file://, file:/// prefix with no luck
permissions issues - I tried solving this by making the folder 777
some MacBook issue which is probably not relevant to me since I'm working on Ubuntu
security issues - examining my stacktrace, I couldn't find any security failure.
removing the / prefix at the beginning of my file path - I tried it as well with no luck
other unanswered questions regarding this problem

Does anyone has any idea on what exactly is the problem? And how to overcome it?

Thanks in advance

Solution

Ok so I found the problem and I hope this will help others

Apparently the machine I'm running on has hadoop installed on it. When I ran hadoop version it outputted: Hadoop 2.6.0-cdh5.7.1 which is conflicting to my Spark version

Also, I'm not quite sure if its related or not but I was running spark from root instead of as Spark user which may have caused some permission issues

After matching the hadoop version to our spark (in our case we matched Spark to be cloudera's Spark) and running the code as a Spark user this failure stopped happening