Search code examples
pythonfileoperating-systemglobpathlib

Deleting files from a directory using a csv file in python


I am looking to delete files from multiple folders using a .csv file. The csv file contains a list of file names that need to be deleted(example: Box4, 60012-01). How the data is stored is in multiple folder and also had additional extension (example: /tiles_20X_299/Box20/660491-3_mag20_xpos5980_ypos6279.jpg. Is there a way to get these files deleted. Help would be really appreciated. This is what I have till now but not sure if I'm going the right direction. [sample of the csv file to delete][1]

fin = open('files_to_delete.csv', 'r')
fin.readline()
print(fin)
file_to_delete = set()
while True:
    line = fin.readline().strip()
    #print(line)
    if not line:
        break
    array = line.split(',')
    file_to_delete.add("Box" + array[0] + "/" + array[1])
fin.close()
print(file_to_delete)
#
for path in glob.glob('/home/sshah/Tiles/tiles_20X_299/*'):
    for f in file_to_delete:
        print(f)
        os.chdir(path)
        #print(path)
        if os.path.exists(f):
            print('delete')
            #os.remove(f)```


  [1]: https://i.sstatic.net/dFCxk.png

Solution

  • You're definitely going in the right direction.

    Assuming you're running at least version 3.5 of Python, you can use glob.iglob() to recursively iterate over every file in every subdirectory.

    I've tweaked your code to make it a bit more pythonic.

    Some specific changes:

    • Renamed the file_to_delete set to files_to_delete because it contains multiple files and should be plural.

    • Used a with statement with the file object's context manager to avoid worrying about exceptions and explicitly calling .close().

    • Looped over fin to get each line without explicitly calling .readline().

    • Used os.path.sep instead of hardcoding /.

    • Removed both unnecessary os.chdir(path) and os.path.exists(f) calls.

    It works by iterating over every file in every subdirectory (which gives us the full filepath as a str), then we iterate over the files_to_delete set to check if every file_to_delete is a substring of the filepath. If it is, delete the file, and break out of that loop to continue with the next filepath.

    If you know there are no other filenames with a similar base, you can uncomment this line: files_to_delete.remove(file_to_delete). For example, if you have a file called:

    /tiles_20X_299/Box20/660491-3_mag20_xpos5980_ypos6279.jpg

    but not another one called:

    /tiles_20X_299/Box20/660491-3_mag10_xpos2000_ypos4000.jpg

    To be safe, leave it commented out.

    import glob, os
    
    files_to_delete = set()
    
    with open('files_to_delete.csv', 'r') as fin:
        fin.readline() # Consume header
        for line in fin:
            line = line.strip()
            if line:
                files_to_delete.add('Box' + line.replace(',', os.path.sep)) # Assume none of the files contain a comma
    
    print(files_to_delete)
    
    for filepath in glob.iglob(r'/home/sshah/Tiles/tiles_20X_299/**/*', recursive=True):
        for file_to_delete in files_to_delete:
            if file_to_delete in filepath:
                print('Delete:', filepath)
                #os.remove(filepath)
                #files_to_delete.remove(file_to_delete)
                break