Search code examples
hadoophdfscluster-computing

Dropping an HDFS partition key


I've realised I have a huge amount on data partitioned on too small files on a HDFS. The reason for this, is that I've saved the data using too much partitioning keys. Therefore, I need to merge the data under that partitioning key in the HDFS.

Fortunately, the partitioning key I want to delete is exactly the last one (I don't know if it makes it easier). I cannot come across a solution not using a script that would take too much time to do the job.

Here is an example of the HDFS I have:

/part1={lot_of_values}/part2={lot_of_values}/part_to_delete={lot_of_values}/{lot_of_files}.parquet

But I want to achieve:

/part1={lot_of_values}/part2={lot_of_values}/{lot_of_files}.parquet

Therefore I could have bigger files to load quickly.


Solution

  • Fortunately, the partitioning key I want to delete is exactly the last one (I don't know if it makes it easier). I cannot come across a solution not using a script that would take too much time to do the job.

    1. Yes, it makes it pretty easy, you just need to move the files from the leaf directory to its parent dir (and delete the now empty dir). This is not a bigdata job, only file system operations. Unless we're talking about many thousands of partitions, this should not take long. If there's some Hive catalog involved, you will have to update it also.
    2. Yes again, you will have to run some Hadoop jobs to merge the parquet files. The time it takes totally depends on your data and resources. The jobs themselves are pretty simple and straightforward.