Search code examples
hadoophivehdfshqlhadoop-partitioning

How to delete the most recently created files in multiple HDFS directories?


I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.

For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls

/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801

I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.

The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.

I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.


Solution

  • AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file

    like this

    hdfs dfs -ls /tmp | sort -k6,7 > files.txt
    

    Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100

    then you can update your command to

    hdfs dfs -ls /tmp | sort -k6,7 | head -100 |  awk '{print $8}' > files.txt
    

    or if you know specific timestamp of your new files then you can try below command

    hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt
    

    Then read that file and delete all files one by one

    while read file; do
      hdfs -rm $file
      echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
    
    done <files.txt
    

    So you complete script can be like

    #!/bin/bash
    
     hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt
    
     while read file; do
         hdfs -rm $file
         echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
    
       done <files.txt