hadoop hive hdfs hql hadoop-partitioning

How to delete the most recently created files in multiple HDFS directories?

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.

For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls

/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801

I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.

The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.

I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.

Solution

AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file

like this

hdfs dfs -ls /tmp | sort -k6,7 > files.txt

Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100

then you can update your command to

hdfs dfs -ls /tmp | sort -k6,7 | head -100 |  awk '{print $8}' > files.txt

or if you know specific timestamp of your new files then you can try below command

hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt

Then read that file and delete all files one by one

while read file; do
  hdfs -rm $file
  echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

done <files.txt

So you complete script can be like

#!/bin/bash

 hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" |  awk '{print $8}' > files.txt

 while read file; do
     hdfs -rm $file
     echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

   done <files.txt