What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.
The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if
1) Imapala approach
impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'
2) Hive approach
Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$
would be efficient
Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.
Partition your table with the columns of your interest and based on your queries for best results
Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it
Step 2: Set hive parameters to support Dynamic partition
SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;
Step 3: Create your Main Hive Table with partition columns, example :
CREATE TABLE employee (
emp_id int,
emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;
Step 4: Load data from Temporary table to your employee table (Main Table)
insert overwrite table employee partition(location)
select emp_id,emp_name, location from TmpTable;
Step 5: export the data from hive with a condition
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee WHERE location='CALIFORNIA';
Please refer this link:
Hope this is useful.