I have a hive table, htable
that's partitioned on foo
and bar
. I want to create a small subset of this table for experiments, so I would think the thing to do would be
create table new_table like htable;
insert into new_table partition (foo, bar) select * from htable
where rand() < 0.01 and foo in (a,b)
This takes forever however and finally fails with a java.lang.OutOfMemoryError: Java heap space
. Is there a better way?
Add distribute by foo, bar
:
insert into new_table partition (foo, bar) select * from htable
where rand() < 0.01 and foo in (a,b)
distribute by foo, bar
this will reduce memory consumption.