I have used df.to_parquet(path="...", partition_cols="period")
to write a dataframe to parquet files on HDFS (Cloudera CDH). This works fine, a subfolder structure is created (period=YYYYMMDD
) and when I read individual parquet files they contain the expected data (as expected they do not contain the period
column as it was partitioned out). I am now trying to create a table using Impala/Hive.
CREATE EXTERNAL TABLE lab.tablename
LIKE PARQUET '/data/.../db/tablename/period=20200101/dhfjksdhdsakjdhjk.parquet'
PARTITIONED BY (period INT)
STORED AS PARQUET
LOCATION '/data/.../db/tablename'
At first, it seems to work. The table shows up in the database, the columns are correctly listed,
the period
key is even listed as partition key. However
select * from lab.tablename
>> Done. 0 results.
Somehow Impala doesn't seem to set the references correctly. I also tried to:
COMPUTE STATS lab.tablename
>> Updated 0 partition(s) and 2 column(s).
and get 0 rows when checking
SHOW TABLE STATS lab.tablename
If I write the same data in a non partitioned fashion, it works fine.
Hive isn't aware of the partitions in the metastore even though you used the correct foldering. use:
MSCK REPAIR TABLE lab.tablename
And it should discover the partitions that are missing and add them to the metastore.
Impala: run this command so it picks up the changes:
INVALIDATE METADATA lab.tablename
Good Read if you want a more detailed explanation.