After a recent upgrade to HDP 3.1
now using spark 2.3.x
instead of 2.2.x
a query like:
spark.sql("SELECT * from mydb.mytable").filter('partition_date between "202001010000" and "202001020000").write.parquet("foo.out")
sometimes fails when reading from an HDFS backed hive table (no object storage). You have to know that the underlying data (an EXTERNAL table in Hive) has a data retention period and any data older than this date will be deleted. Sometimes, this deletion might occur during the execution of the above-mentioned query. The deletion happens every 5 minutes.
Even though:
PartitionFilters: [isnotnull(partition_date#3099), (partition_date#3099 >= 202001010000), (partition_date#3099 <= 202001020000)]
partition filtering (predicate pushdown) seems to be enabled more than the desired partitions are read during the initial path traversal. After the upgrade to 2.3, Spark shows in the UI the progress of listing file directories. Interestingly, we always get two entries. One for the oldest available directory, and one for the lower of the two boundaries of interest:
Listing leaf files and directories for 380 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date==202001010000/sub_part=0, ...
Listing leaf files and directories for 7100 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0, ...
Notice:
the error message:
File does not exist: /path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0/file_name_unique_hash_timestamp.par
How can I force Spark to list only directories in the desired interval and not outside and potentially collide with the maximum data retention duration?
It looks like this is related:
@meniluca is correct in the sense that there must be a mismatch with what HDFS has available and the Hive metastore reports as what should be available.
However, instead of utilizing views which look a bit spooky/not easy to understand (in the context of file paths being included in the read operation), I prefer:
spark.read.option("basePath", "/path/to/mydb.db/mytable").orc("/path/to/mydb.db/mytable/partition_date=202001[1-2]*/*", "/path/to/mydb.db/mytable/partition_date=202001[3-4]*/*")
this forces spark to list the right (desired paths)