apache-spark cluster-computing pyspark apache-spark-standalone

How to run spark distributed in cluster mode, but take file locally?

Is it possible to have spark take a local file as input, but process it distributed?

I have sc.textFile(file:///path-to-file-locally) in my code, and I know that the exact path to the file is correct. Yet, I am still getting

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 14, spark-slave11.ydcloud.net): java.io.FileNotFoundException: File file:/<path to file> does not exist

I am running spark distributed, and not locally. Why the error exist?

Solution

It is possible but when you declare local path as an input it has to be present on each worker machine and the driver. So it means you have to distribute it first either manually or using built-in tools like SparkFiles.