hadoop mapreduce hive cloudera hadoop-partitioning

How to reduce number of mappers, when I am running hive query?

I am using hive ,

I have 24 json files with total size of 300MB (in one folder), so I have created one external table(i.e table1) and I loaded the data(i.e 24 files ) Into external table.

When I am running select query on top of that external table(i.e table1), I observed 3 mappers and 1 reducer is running.

After that I have created one more external table(i.e table2).

I have compressed the my input files (folder which contains 24 files ).

Example : BZIP2

So it compress the data but 24 files created with extension “.BZiP2” (i.e..file1.bzp2,…..file24.bzp2).

After that , I have load the my compressed files into my external table .

Now, when I am running select query , it is taking 24 mappers and 1 reducer. And observed CPU time is taking more time when compared to uncompressed data(i.e files) .

How can I reduce number of mappers, if data is in compressed format(i.e table2 select query )?

How can I reduce CPU time , if data is in compressed format(i.e table2 select query )? How CPU time will affect performance?

Solution

The number of mappers can be less than the number of files only if files are on the same data node. If files are located on different datanodes, the number of mappers will never be less than the number of files. Concatenate all /some files and put them into your table location. use cat command for concatenating non-compressed files. You got 24 mappers because you have 24 files.Parameters mapreduce.input.fileinputformat.split.minsize / maxsize are for splitting bigger files.