Search code examples
hivehiveqlhadoop2hadoop-partitioning

hive script failing due to heap space issue to process too many partitions


my script failing due to a heap space issue to process too many partitions. To avoid the issue I am trying to insert all the partitions into a single partition but I am facing the below error

FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different ''2021-01-16'': Table insclause-0 has 78 columns, but query has 79 columns.

    set hive.exec.dynamic.partition=true;
    set mapreduce.reduce.memory.mb=6144;
    set mapreduce.reduce.java.opts=-Xmx5g;
    set hive.exec.dynamic.partition=true;
    insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
    select * from db_temp.travel_history_denorm_temp_bq
    distribute by start_date;```


Can someone please suggest what is the issue, I checked the schema for the tables it is the same. ?

Solution

  • You are inserting into static partition (partition value specified in the target table partition clause), in this case you should not have partition column in the select. And select * returns partition column (the last one), this is why query fails, it should be no partition column:

    Static partition insert:

    insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
       select col1, col2, col3 ... --All columns except start_date partition column
         from ...
    

    Dynamic partition:

     insert overwrite table db_temp.travel_history_denorm partition (start_date)
           select * --All columns in the same order, including partition
             from ...
    

    Adding distribute by triggers additional reduce step, all records are being grouped according to distribute by and each reducer receives single partition. This can help to solve the OOM problem when you are loading many dynamic partitions in each reducer. Without distribute by each reducer will create files in each partitions, keeping too many buffers simultaneously.

    In addition to distribute by you can set the maximum bytes per reducer. This setting will limit the amount of data processed by single reducer and also may help with OOM:

     set hive.exec.reducers.bytes.per.reducer=16777216; --adjust for optimal performance
    

    If this figure is too small, it will trigger too many reducers, if too big - then each reducer will process too much data. Adjust accordingly.

    Also try this setting for dynamic partition load:

    set hive.optimize.sort.dynamic.partition=true;
    

    When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

    You can combine all these methods: distribute by partition key, bytes.per.reducer and sort.dynamic.partition for dynamic partition loading.

    Also exception message can help to understand where exactly the OOM happens and fix accordingly.