I am trying to create a very simple Zeppelin Notebook, which reads a csv file and does analysis on the file. However, I am running into a very weird error. Despite the file being shown in ls
command, when I am trying to read it as read.csv
, I am getting java.io.FileNotFoundException
.
There is a good chance that, by default, your Zeppelin notebook (and the underlying Spark stack) is configured to look in HDFS for relative file paths.
Therefore, you probably need to use an absolute file path, mentionning that you are working on your file system.
data = spark.csv.read("file:///data/your_path/banks.csv")
If your notebook connects to a Spark installed cluster, then accessing local filesystem is not a good idea (you'd have to manually deploy files to everynode in the cluster, keep them in sync...)... Well that's why HDFS is made for.
So your best bet would be to take advantage of it. Put your file somewhere in your HDFS storage, then load it from spark over hdfs.
In your shell :
hdfs dfs -put /file_system_path/banks.csv "/user/zeppelin/banks.csv"
Please note that the actual path where your HDFS files can be put will vary based on your cluster installation.
Then Spark should be able to load it :
spark.csv.read("/user/zeppelin/banks.csv")
Of course, there are other ways than HDFS to do this. Spark can connect to S3, for instance, and if that suits you better than HDSF, this is a possibility to (read("s3a://...")
)