Is it possible to have spark take a local file as input, but process it distributed?
I have sc.textFile(file:///path-to-file-locally)
in my code, and I know that the exact path to the file is correct. Yet, I am still getting
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 14, spark-slave11.ydcloud.net): java.io.FileNotFoundException: File file:/<path to file> does not exist
I am running spark distributed, and not locally. Why the error exist?
It is possible but when you declare local path as an input it has to be present on each worker machine and the driver. So it means you have to distribute it first either manually or using built-in tools like SparkFiles
.