java hadoop mapreduce hadoop-partitioning

Hadoop MapReduce - How to create dynamic partition

How to create dynamic partition using java map reduce, like sql we have group by country column. Example i have country based dataset and need to separate the records based on country ( partition). We can't limit the coutry. since every day will get new country data.

Solution

You can leverage the dynamic partitioning feature of Hive to automatically populate partitions based on incoming data. Below example demonstrates auto-partitioning of raw data based upon country information.

Create a raw data file (country1.csv), which has data for multiple countries

1,USA
2,Canada
3,USA
4,Brazil
5,Brazil
6,USA
7,Canada

Upload this file to a location in HDFS

hadoop fs -mkdir /example_hive
hadoop fs -mkdir /example_hive/country
hadoop fs -put country1.csv /example_hive/country

Create a non-partitioned Hive table on top of the data

CREATE EXTERNAL TABLE country
(
id int, 
country string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///example_hive/country';

Verify that the Hive table is created correctly

hive (default)> select * from country;
1   USA
2   Canada
3   USA
4   Brazil
5   Brazil
6   USA
7   Canada

Create a partitioned Hive table, with country as the partition

hive (default)> CREATE TABLE country_par
(
id int
)
PARTITIONED BY (country string);

Enable dynamic partitioning

hive (default)> SET hive.exec.dynamic.partition = true;
hive (default)> SET hive.exec.dynamic.partition.mode = nonstrict;

Populate the partitioned table, with Hive automatically putting the data in the right country partition

hive (default)> INSERT INTO TABLE country_par 
PARTITION(country)
SELECT id,country FROM country;

Verify that the partitions were created, and populated correctly

hive (default)> show partitions country_par;
country=Brazil
country=Canada
country=USA

hive (default)> select * from country_par where country='Brazil';
4   Brazil
5   Brazil

hive (default)> select * from country_par where country='USA';
1   USA
3   USA
6   USA

hive (default)> select * from country_par where country='Canada';
2   Canada
7   Canada

hive (default)> select country,count(*) from country_par group by country;
Brazil  2
Canada  2
USA 3