Search code examples
pathpysparkreadfileexistsdatabricks

In pyspark, reading csv files gets failed if even 1 path does not exist. How can we avoid this?


In pyspark reading csv files from different paths gets failed if even one path does not exist.

Logs = spark.read.load(Logpaths, format="csv", schema=logsSchema, header="true", mode="DROPMALFORMED");

Here Logpaths is an array that contain multiple paths. And these paths are created dynamically depending upon given startDate and endDate range. If Logpaths contain 5 paths and first 3 exists but 4th does not exist. Then whole extraction gets failed. How can I avoid this in pyspark or how can I check there existance before reading?

In scala I did this by checking file existance and filter out non-existed records by using hadoop hdfs filesystem globStatus function.

Path = '/bilal/2018.12.16/logs.csv'
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
val fileStatus = fs.globStatus(new org.apache.hadoop.fs.Path(Path));

Solution

  • So I got what I was looking for. Like the code I posted in the question which can be used in scala for file existance check. We can use below code in case of PySpark.

    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
    fs.exists(sc._jvm.org.apache.hadoop.fs.Path("bilal/logs/log.csv"))
    

    This is exactly the same code also used in scala, so in this case we are using java library for hadoop and java code runs on JVM on which spark is running.