Search code examples
apache-sparkparquetfile-exists

CentOS | error apache spark file already exists Sparkcontext


I am unable to write to the file which I create. In windows it's working fine. In centos it says file already exists and does not write anything.

File tempFile= new File("temp/tempfile.parquet");
tempFile.createNewFile();
parquetDataSet.write().parquet(tempFile.getAbsolutePath());

Following is the error: file already exists

2020-02-29 07:01:18.007 ERROR 1 --- [nio-8090-exec-1] c.gehc.odp.util.JsonToParquetConverter   : Stack Trace: {}org.apache.spark.sql.AnalysisException: path file:/temp/myfile.parquet already exists.;
2020-02-29 07:01:18.007 ERROR 1 --- [nio-8090-exec-1] c.gehc.odp.util.JsonToParquetConverter   : sparkcontext close

Solution

  • The default savemode in spark is ErrorIfExists. This means that if the file with the same filename you intend to write already exists, it will give an exception similar to the one you got above. This is happening in your case because you are creating the file yourself rather than leaving that task to spark. There are 2 ways in which you can resolve the situation:

    1) You can either mention savemode as "overwrite" or "append" in the write command:

    parquetDataSet.write.mode("overwrite").parquet(tempFile.getAbsolutePath());
    

    2) Or, you can simply remove the create new file command and straightaway pass the destination path in your spark write command as follows:

    parquetDataSet.write.parquet("temp/tempfile.parquet");