Search code examples
hadoophiveazure-hdinsighthive-partitionshiveddl

Repartition in Hadoop


My question is mostly theoretical, but i have some tables that already follow some sort of partition scheme, lets say my table is partitioned by day, but after working with the data for sometime we want to modifity to month partitions instead, i could easily recreare the table with the new partition definition and reinsert the data, is this the best approach? sounds slow when the data is huge, i have seen there are multiple alter commands in hive for partitions, is there one that can help me achieve what i need?

Maybe there is another choice of concatenating the files and then recreating the table with the new partition?

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

If there are any relevant references they are appreciated as well.


Solution

  • If the files are in daily folders, you can not mount many daily folders into single month partition, for each month, files needs to be moved to month folder. You can not do it as a metadata only operation.

    If you are good in shell scripting you can write loop in hadoop fs -ls <table location> | sort, in the loop save path into variable, check if substring including yyyy-MM is different from previous, then create yyyy-MM folder. For each row in a loop do copy everything into month location (hadoop fs -cp daily_location/* month_location/), all can be done in single loop.

    If you are on S3 and using AWS-CLI commands, creating of folders is not necessary, just copy.

    If there are too many small files, you may want to concatenate them in monthly folders, if it is ORC, you can execute ALTER TABLE PARTITION CONCATENATE. If not ORC, then better use Hive INSERT OVERWRITE, it will do all that for you, you can configure merge task and finally your files will be in optimal size. Additionally you can improve compression efficiency and make possible to use bloom filters and internal indexes(if it is ORC/Parquet) if you add distribute by partition_col sort by <keys used in filters/joins>, this will greatly reduce table size and improve queries performance.

    So, better use Hive for this task because it gives you opportunity to improve data storage: change storage format, concatenate files, sort to reduce compressed size and make indices and bloom filters be really useful.