Search code examples
hadoopcloudera-cdhimpala

Impala - Replace all data in a table's partition


I have a program that generates all the data concerning a Impala table partition. This program writes the data in a HDFS Text file.

How to (physically) remove all the data previously belonging to the partition and replace them with the data in the new Text file converted in Parquet format ?

If I physically remove the old Parquet files composing the partition using raw HDFS API, is it going to disturb Impala ?


Solution

  • Create table for your text files:

    create external table stg_table (...) location '<your text file in hdfs>';
    

    After external data change you have to refresh it:

    refresh stg_table;
    

    Then insert into you target table

    insert overwrite table target_table select * from stg_table;
    

    If your target table is partitioned, do this:

    insert overwrite table target_table partiton(<partition spec>) select * from stg_table;
    

    keyword 'overwrite' does the trick, it overwrites table or partition.