Search code examples
hivesqoopimpala

Export the data from a hive/impala table with few conditions into file


What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.

The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if

1) Imapala approach

impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'

2) Hive approach

Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$

would be efficient


Solution

  • Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.

    Partition your table with the columns of your interest and based on your queries for best results

    Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it

    Step 2: Set hive parameters to support Dynamic partition

    SET hive.exec.dynamic.partition.mode=non-strict;
    SET hive.exec.dynamic.partition=true;
    

    Step 3: Create your Main Hive Table with partition columns, example :

    CREATE TABLE employee (
     emp_id int,
     emp_name string
    PARTITIONED BY (location string)
    STORED AS PARQUET;
    

    Step 4: Load data from Temporary table to your employee table (Main Table)

    insert overwrite table employee  partition(location)  
    select emp_id,emp_name, location from TmpTable;
    

    Step 5: export the data from hive with a condition

    INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee  WHERE location='CALIFORNIA';
    

    Please refer this link:

    Dynamic Partition Concept

    Hope this is useful.