Search code examples
apache-sparkapache-zeppelin

Apache Zeppelin gives java.io.FileNotFoundException despite file being present in the location


I am trying to create a very simple Zeppelin Notebook, which reads a csv file and does analysis on the file. However, I am running into a very weird error. Despite the file being shown in ls command, when I am trying to read it as read.csv, I am getting java.io.FileNotFoundException.

ls command shows bank.csv file (4th from top) enter image description here

But getting exception when trying to read the file. enter image description here


Solution

  • In a local / standalone Zeppelin installation...

    There is a good chance that, by default, your Zeppelin notebook (and the underlying Spark stack) is configured to look in HDFS for relative file paths.

    Therefore, you probably need to use an absolute file path, mentionning that you are working on your file system.

    data = spark.csv.read("file:///data/your_path/banks.csv")
    

    In a cluster Zeppelin installation

    If your notebook connects to a Spark installed cluster, then accessing local filesystem is not a good idea (you'd have to manually deploy files to everynode in the cluster, keep them in sync...)... Well that's why HDFS is made for.

    So your best bet would be to take advantage of it. Put your file somewhere in your HDFS storage, then load it from spark over hdfs.

    In your shell :

    hdfs dfs -put /file_system_path/banks.csv "/user/zeppelin/banks.csv"
    

    Please note that the actual path where your HDFS files can be put will vary based on your cluster installation.

    Then Spark should be able to load it :

    spark.csv.read("/user/zeppelin/banks.csv")
    

    Of course, there are other ways than HDFS to do this. Spark can connect to S3, for instance, and if that suits you better than HDSF, this is a possibility to (read("s3a://..."))