Search code examples
apache-sparkpysparkhivepartitioningorc

Partitioned ORC table shows up empty in Hive


I've written a Spark dataframe to partitioned ORC files like this:

df.repartition("LOADED")\
  .write\
  .partitionBy("LOADED")\
  .format("orc")\
  .save("location")

Everything is on the disk correctly. After that, I wanted to create a Hive table from it, like:

CREATE TABLE table USING ORC LOCATION 'location'

The command runs without any errors. But if I try to query the table, it's empty.

I've tried to do the same without partitioning, and it works just fine. What am I doing wrong? The partitioned folders look like: LOADED=2019-11-16

For reference: I want to write the data to Azure Blob Storage, and create a Hive table from it in a different cluster.


Solution

  • You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:

    spark.sql("MSCK REPAIR TABLE <tableName>")
    

    More info on this command here

    Quick example here

    scala> spark.sql("select * from table").show
    20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
    +------+------+
    |column|LOADED|
    +------+------+
    +------+------+
    
    scala> spark.sql("MSCK REPAIR TABLE table")
    
    
    scala> spark.sql("select * from table").show
    +------+----------+
    |column|    LOADED|
    +------+----------+
    |     a|2019-11-16|
    |     c|2019-11-16|
    |     b|2019-11-17|
    +------+----------+