apache-spark apache-spark-sql hdfs directory-listing

spark Listing leaf files fails with File does not exist

After a recent upgrade to HDP 3.1 now using spark 2.3.x instead of 2.2.x a query like:

spark.sql("SELECT * from mydb.mytable").filter('partition_date between "202001010000" and "202001020000").write.parquet("foo.out")

sometimes fails when reading from an HDFS backed hive table (no object storage). You have to know that the underlying data (an EXTERNAL table in Hive) has a data retention period and any data older than this date will be deleted. Sometimes, this deletion might occur during the execution of the above-mentioned query. The deletion happens every 5 minutes.

Even though:

PartitionFilters: [isnotnull(partition_date#3099), (partition_date#3099 >= 202001010000), (partition_date#3099 <= 202001020000)]

partition filtering (predicate pushdown) seems to be enabled more than the desired partitions are read during the initial path traversal. After the upgrade to 2.3, Spark shows in the UI the progress of listing file directories. Interestingly, we always get two entries. One for the oldest available directory, and one for the lower of the two boundaries of interest:

Listing leaf files and directories for 380 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date==202001010000/sub_part=0, ...

Listing leaf files and directories for 7100 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0, ...

Notice:

the logged number of files (308, 7100) both do not seem to reflect what a manual check would suggest
the job (sometimes) fails during the recursive listing of leaf files
the error message:

File does not exist: /path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0/file_name_unique_hash_timestamp.par

How can I force Spark to list only directories in the desired interval and not outside and potentially collide with the maximum data retention duration?

It looks like this is related:

Spark lists all leaf node even in partitioned data (though for S3)

Solution

@meniluca is correct in the sense that there must be a mismatch with what HDFS has available and the Hive metastore reports as what should be available.

However, instead of utilizing views which look a bit spooky/not easy to understand (in the context of file paths being included in the read operation), I prefer:

spark.read.option("basePath", "/path/to/mydb.db/mytable").orc("/path/to/mydb.db/mytable/partition_date=202001[1-2]*/*", "/path/to/mydb.db/mytable/partition_date=202001[3-4]*/*")

this forces spark to list the right (desired paths)