Search code examples
hiveimpala

Impala/Hive expose partitioned parquet files as table


I have used df.to_parquet(path="...", partition_cols="period") to write a dataframe to parquet files on HDFS (Cloudera CDH). This works fine, a subfolder structure is created (period=YYYYMMDD) and when I read individual parquet files they contain the expected data (as expected they do not contain the period column as it was partitioned out). I am now trying to create a table using Impala/Hive.

CREATE EXTERNAL TABLE lab.tablename
LIKE PARQUET '/data/.../db/tablename/period=20200101/dhfjksdhdsakjdhjk.parquet'
PARTITIONED BY (period INT)
STORED AS PARQUET
LOCATION '/data/.../db/tablename'

At first, it seems to work. The table shows up in the database, the columns are correctly listed, the period key is even listed as partition key. However

select * from lab.tablename
>>  Done. 0 results.

Somehow Impala doesn't seem to set the references correctly. I also tried to:

COMPUTE STATS lab.tablename
>> Updated 0 partition(s) and 2 column(s).

and get 0 rows when checking

SHOW TABLE STATS lab.tablename

If I write the same data in a non partitioned fashion, it works fine.


Solution

  • Hive isn't aware of the partitions in the metastore even though you used the correct foldering. use:

    MSCK REPAIR TABLE lab.tablename
    

    And it should discover the partitions that are missing and add them to the metastore.

    Impala: run this command so it picks up the changes:

    INVALIDATE METADATA lab.tablename
    

    Good Read if you want a more detailed explanation.