Search code examples
javaapache-sparkbigdataparquet

User class threw exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually


I'm implementing a spark java code as, Dataset input = spark.read().parquet(configuration.getInputDataLocation());

But the the inputDataLocation(A folder in Azure Storage Account container) may not have any data and in such use cases exception is being thrown, User class threw exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

Is there a brief way to check if the file folder is empty beforehand and then only I process the spark java code line written above.


Solution

  • Why don't you try a read in the input dir to check if it exists?

           final boolean exists;
                try {
                    exists = file.getFileSystem(spark.sparkContext().hadoopConfiguration()).getFileStatus(file).isFile();
    
    //exists = dir.getFileSystem(spark.sparkContext().hadoopConfiguration()).listStatus(dir).length // (0 length is an empty dir)
                } catch (IOException e) {
                    throw new UncheckedIOException(e);
                }
        
                if (exists) {
                    return spark.read().parquet(configuration.getInputDataLocation());
                } else {
                    LOG.warn("File directory '{}' does not exist", file);
                    return spark.emptyDataset(SOME_ENCODER);
                }
            }