python python-2.7 file-management data-management

Find and remove duplicate files using Python

I have several folders which contain duplicate files that have slightly different names (e.g. file_abc.jpg, file_abc(1).jpg), or a suffix with "(1) on the end. I am trying to develop a relative simple method to search through a folder, identify duplicates, and then delete them. The criteria for a duplicate is "(1)" at the end of file, so long as the original also exists.

I can identify duplicate okay, however I am having trouble creating the text string in the right format to delete them. It needs to be "C:\Data\temp\file_abc(1).jpg", however using the code below I end up with r"C:\Data\temp''file_abc(1).jpg".

I have looked at answers [Finding duplicate files and removing them, however this seems to be far more sophisticated than what I need.

If there are better (+simple) ways to do this then I let me know, however I only have around 10,000 files in total in 50 odd folders, so not a great deal of data to crunch through.

My code so far is:

import os

file_path = r"C:\Data\temp"
file_list = os.listdir(file_path)
print (file_list)

for file in file_list:
    if ("(1)" in file):
    index_no = file_list.index(file)
    print("!! Duplicate file, number in list: "+str(file_list.index(file)))
    file_remove = ('r"%s' %file_path+"'\'"+file+'"')
    print ("The text string is: " + file_remove)
    os.remove(file_remove)

Solution

Your code is just a little more complex than necessary, and you didn't apply a proper way to create a file path out of a path and a file name. And I think you should not remove files which have no original (i. e. which aren't duplicates though their name looks like it).

Try this:

for file_name in file_list:
    if "(1)" not in file_name:
        continue
    original_file_name = file_name.replace('(1)', '')
    if not os.path.exists(os.path.join(file_path, original_file_name):
        continue  # do not remove files which have no original
    os.remove(os.path.join(file_path, file_name))

Mind though, that this doesn't work properly for files which have multiple occurrences of (1) in them, and files with (2) or higher numbers also aren't handled at all. So my real proposition would be this:

Make a list of all files in the whole directory tree below a given start (use os.walk() to get this), then
sort all files by size, then
walk linearly through this list, identify the doubles (which are neighbours in this list) and
yield each such double-group (i. e. a small list of files (typically just two) which are identical).

Of course you should check the contents of these few files then to be sure that not just two of them are accidentally the same size without being identical. If you are sure you have a group of identical ones, remove all but the one with the simplest names (e. g. without suffixes (1) etc.).

By the way, I would call the file_path something like dir_path or root_dir_path (because it is a directory and a complete path to it).