Search code examples
javahadoopmapreducehadoop-partitioning

Hadoop MapReduce - How to create dynamic partition


How to create dynamic partition using java map reduce, like sql we have group by country column. Example i have country based dataset and need to separate the records based on country ( partition). We can't limit the coutry. since every day will get new country data.


Solution

  • You can leverage the dynamic partitioning feature of Hive to automatically populate partitions based on incoming data. Below example demonstrates auto-partitioning of raw data based upon country information.

    Create a raw data file (country1.csv), which has data for multiple countries

    1,USA
    2,Canada
    3,USA
    4,Brazil
    5,Brazil
    6,USA
    7,Canada
    

    Upload this file to a location in HDFS

    hadoop fs -mkdir /example_hive
    hadoop fs -mkdir /example_hive/country
    hadoop fs -put country1.csv /example_hive/country
    

    Create a non-partitioned Hive table on top of the data

    CREATE EXTERNAL TABLE country
    (
    id int, 
    country string
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'
    LOCATION 'hdfs:///example_hive/country';
    

    Verify that the Hive table is created correctly

    hive (default)> select * from country;
    1   USA
    2   Canada
    3   USA
    4   Brazil
    5   Brazil
    6   USA
    7   Canada
    

    Create a partitioned Hive table, with country as the partition

    hive (default)> CREATE TABLE country_par
    (
    id int
    )
    PARTITIONED BY (country string);
    

    Enable dynamic partitioning

    hive (default)> SET hive.exec.dynamic.partition = true;
    hive (default)> SET hive.exec.dynamic.partition.mode = nonstrict;
    

    Populate the partitioned table, with Hive automatically putting the data in the right country partition

    hive (default)> INSERT INTO TABLE country_par 
    PARTITION(country)
    SELECT id,country FROM country;
    

    Verify that the partitions were created, and populated correctly

    hive (default)> show partitions country_par;
    country=Brazil
    country=Canada
    country=USA
    
    hive (default)> select * from country_par where country='Brazil';
    4   Brazil
    5   Brazil
    
    hive (default)> select * from country_par where country='USA';
    1   USA
    3   USA
    6   USA
    
    hive (default)> select * from country_par where country='Canada';
    2   Canada
    7   Canada
    
    hive (default)> select country,count(*) from country_par group by country;
    Brazil  2
    Canada  2
    USA 3