Search code examples
hadoophivehiveqlhadoop-partitioning

how to constraint hive query file output to be in a single file always


I have created a hive table using below query, and inserting data to this table on daily basis using second query as mentioned below

create EXTERNAL table IF NOT EXISTS DB.efficacy
(
product string,
TP_Silent INT,
TP_Active INT,
server_date date
)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://hdfsadlproduction/user/DB/Report/efficacy';

Insert INTO DB.efficacy
select 
    product,
    SUM(CASE WHEN verdict = 'TP_Silent' THEN 1 ELSE 0 END ),
    SUM(CASE WHEN verdict = 'TP_Active' THEN 1 ELSE 0 END ) ,
    current_date()
from
    DB.efficacy_raw
group by 
    product
;

The issue is that everyday when my insert query executes it basically creates a new file in hadoop FS. I want every day query output to get appended in a same single file only, but Hadoop FS contains the files in the following manner. 000000_0, 000000_0_copy_1, 000000_0_copy_2

I have used below hive settings:-

SET hive.execution.engine=mr;
SET tez.queue.name=${queueName};
SET mapreduce.job.queuename=${queueName};
SET mapreduce.map.memory.mb  = 8192;
SET mapreduce.reduce.memory.mb = 8192;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.parallel = true;
SET hive.exec.parallel.thread.number = 2;
SET mapreduce.input.fileinputformat.split.maxsize=2048000000;
SET mapreduce.input.fileinputformat.split.minsize=2048000000;
SET mapreduce.job.reduces = 20;
SET hadoop.security.credential.provider.path=jceks://hdfs/user/efficacy/s3-access/efficacy.jceks;
set hive.vectorized.execution.enabled=false;
set hive.enforce.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=false;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
**set hive.exec.reducers.max=1;**

I am beginner into hive and hadoop era so pl excuse. Any help will be greatly appreciated

Note:- I am using Hadoop 2.7.3.2.5.0.55-1


Solution

  • I didn't see any direct mechanism available or hive settings which will automatically merge all the small files at the end of the query. The concatenation of small files are currently not supported for files stored as text file.

    As per the comment by "leftjoin" in my post, I have created the table in ORC format, and then used CONCATENATE hive query to merge all the small files into single big file.

    I then used below hive query to export data from this single big ORC file into single text file, and could able to do my task with this exported text file.

    hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    SELECT * FROM default.foo;
    

    Courtesy:- https://community.hortonworks.com/questions/144122/convert-orc-table-data-into-csv.html