Search code examples
parquetcreate-tablepartitionhuekudu

Impala create parquet table with partition from existing Kudu table


I'm trying to create a parquet table with partition from existing Kudu table in Impala without having to re-specify the columns with data types again.

CREATE TABLE IF NOT EXISTS db_name.parquet_table
PARTITIONED BY (`year` SMALLINT, `month` TINYINT, `day` TINYINT)
STORED AS PARQUET 
As SELECT * FROM db_name.kudu_table limit 0

I get an error message saying

ParseException: Syntax error in line 4:undefined: As SELECT * FROM db_name.parquet_table limit 0 ^ Encountered: AS Expected CAUSED BY: Exception: Syntax error

But When I try to create the parquet table without partition it works using:

CREATE TABLE IF NOT EXISTS db_name.parquet_table STORED AS PARQUET  
AS SELECT * FROM db_name.kudu_table LIMIT 0 

But When I try to add partition using:

ALTER TABLE db_name.parquet_table ADD PARTITION(`year`=0,`month`=0,`day`=0)

I get the following error message:

AnalysisException: Table is not partitioned: db_name.parquet_table

Solution

  • You could try to load data with

    STATIC PARTITIONING

    One way to load data into a partitioned table is to use static partitioning, in which you manually define the different partitions. (The other way is to have the partitions automatically defined when you load the data, this is called Dynamic Partitioning.)

    With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement, and then load the data into the partition.

    With your example I would try this.

    First. Create the partitioned table

    CREATE TABLE IF NOT EXISTS db_name.parquet_table(field1 [TYPE],...,fieldN [TYPE]) --Fields that not belongs to the partitioned columns
    PARTITIONED BY (year SMALLINT, month TINYINT, day TINYINT)
    STORED AS PARQUET;
    

    Second. ALTER TABLE statement to create the partition.

    ALTER TABLE db_name.parquet_table ADD PARTITION(year = 0,month = 0,day = 0);
    

    Notice how the partition column name and the specific value that defines this partition, are both specified in the ADD PARTITION clause. This creates a partition directory inside the table directory.

    After the partition is created, you can add data into the partition using an INSERT … SELECT statement

    INSERT OVERWRITE TABLE db_name.parquet_table
        PARTITION(year = 0, month = 0, day = 0)
        SELECT field1,..., fieldn --do not include partitioned columns
        FROM db_name.kudu_table 
        WHERE year = 0 AND month = 0 AND day = 0;
    

    With static partitioning, you need to repeat these two steps for each partition: first create the partition, then add data. You can actually use any method to load the data; you need not use an INSERT statement. You could instead use hdfs dfs commands or a LOAD DATA INPATH command. But however you load the data, you have to be careful to ensure that data is stored in the correct partition subdirectories.