Search code examples
apache-sparkparquet

How to properly read a folder supposedly contains Parquet files from Spark if the folder is empty


When I try to read a folder which supposedly contains files in Parquet format everything works if there is data, if there is no data I am getting error on the first line and subsequent code doesn't executue

val hdfsData: DataFrame = spark.sqlContext.read.parquet(hdfsPath)
hdfsData.rdd.isEmpty() match ....
....

Error: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

What's the right way to handle this scenario.


Solution

  • Got the same issue and i handled it with a simple Try/Success/Failure

    val acc:DataFrame = session.createDataset(List("foo", "bar")).toDF()
    
    val tryDf:Try[DataFrame] =
          Try(
            session.read.parquet("s3://path-to-bucket/path-to-folder-with-no-parquet-files-under-it/")
          )
        val resultDf:DataFrame = tryDf match {
          case Success(df) => acc.union(df)
          case Failure(f) => {
            println(s"@@ handled ${ f }") // => @@ handled org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; 
            acc
          }
        }
    
        println(s"resultDf.count ${ resultDf.count }") // => 2```