Search code examples
sqlhiveout-of-memoryquery-optimizationhiveql

Hive - How to efficiently Create Table As Select?


I have a hive table, htable that's partitioned on foo and bar. I want to create a small subset of this table for experiments, so I would think the thing to do would be

create table new_table like htable;

insert into new_table partition (foo, bar) select * from htable
where rand() < 0.01 and foo in (a,b)

This takes forever however and finally fails with a java.lang.OutOfMemoryError: Java heap space. Is there a better way?


Solution

  • Add distribute by foo, bar:

        insert into new_table partition (foo, bar) select * from htable
         where rand() < 0.01 and foo in (a,b) 
        distribute by foo, bar
    

    this will reduce memory consumption.