I've written a Spark dataframe to partitioned ORC
files like this:
df.repartition("LOADED")\
.write\
.partitionBy("LOADED")\
.format("orc")\
.save("location")
Everything is on the disk correctly. After that, I wanted to create a Hive table from it, like:
CREATE TABLE table USING ORC LOCATION 'location'
The command runs without any errors. But if I try to query the table, it's empty.
I've tried to do the same without partitioning, and it works just fine. What am I doing wrong?
The partitioned folders look like: LOADED=2019-11-16
For reference: I want to write the data to Azure Blob Storage, and create a Hive table from it in a different cluster.
You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:
spark.sql("MSCK REPAIR TABLE <tableName>")
More info on this command here
Quick example here
scala> spark.sql("select * from table").show
20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------+------+
|column|LOADED|
+------+------+
+------+------+
scala> spark.sql("MSCK REPAIR TABLE table")
scala> spark.sql("select * from table").show
+------+----------+
|column| LOADED|
+------+----------+
| a|2019-11-16|
| c|2019-11-16|
| b|2019-11-17|
+------+----------+