I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.
For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls
/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801
I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.
The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.
I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.
AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file
like this
hdfs dfs -ls /tmp | sort -k6,7 > files.txt
Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100
then you can update your command to
hdfs dfs -ls /tmp | sort -k6,7 | head -100 | awk '{print $8}' > files.txt
or if you know specific timestamp of your new files then you can try below command
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
Then read that file and delete all files one by one
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt
So you complete script can be like
#!/bin/bash
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt