Search code examples
pythonfileio

how to check if a file exists based on the files in a directory and indicate which are missing?


how do I check which files are missing from the directory based on a txt of the files I should have?

E.G this is the list of files I should have

A
B
C
D
E
F
G
H
I

But in my directory I only have

A.npy
B.npy
C.npy
D.npy

So I want to do a script that can produce a result.txt like this:

A [exists]
B [exists]
C [exists]
D [exists]
E [does not exist]
F [does not exist]
G [does not exist]
H [does not exist]
I [does not exist]

This is the script I have currently but it doesn't seem to work as it registers all files as "does not exist" :(

import os
import copy
import pandas as pd
import shutil
from pathlib import Path



# read training files.txt 
path_to_file = 'xxxxxxxxxxxxxxxxxx/train_files_CS/all_training_CSmaster.txt'
path = 'xxxxxxxxxxx/train_files_CS'

# list of training npy files in directory
lof = []
for (dirpath, dirnames, filenames) in os.walk(path):
  lof.append(filenames)

lof = [x[:len(x) - 4] for x in lof[0] if x[0] == 'P']
#print(lof)

# new file to be written into
f = open('check_training.txt', 'w')

existing_files = 0
missing_files = 0

trfiles = []
with open(path_to_file) as file:
    for line in file:
        #print(line.rstrip())
        trfiles.append(line)
        
for x in trfiles:    
    if x in lof:
        existing_files+=1
        f.write(x)
        f.write("...[exists] \n")
    else:
        missing_files+=1
        f.write(x)
        f.write("  ...[doesn't exist] \n")
            
f.close()

print("\nthe missing files are:", missing_files,"\n")
print("the existing files are:",existing_files,"\n")

Any help is appreciated, thank you! :)


Solution

  • Your program works for me after fixing the following two issues:

    Issue 1

    lof = [x[:len(x) - 4] for x in lof[0] if x[0] == 'P']
    

    I don't think you want to only list files that start with the letter 'P'. Perhaps you left this in by mistake after doing some debugging or something. To get all file names remove the if x[0] == 'P' part:

    lof = [x[:len(x) - 4] for x in lof[0]]
    

    Issue 2

    with open(path_to_file) as file:
        for line in file:
            #print(line.rstrip())
            trfiles.append(line)
    

    This doesn't remove the line break characters, so you end up with ['a\n', b\n', etc.]` whose elements don't match in the comparisons in the next step. Use this:

    with open(path_to_file) as file:
        trfiles = file.read().splitlines()
    

    With these two changes you should find you get the expected output.

    Other tips

    There are quite a few places where you can make your code more concise and readable by using list comprehensions instead of for loops. E.g.

    lof = []
    for (dirpath, dirnames, filenames) in os.walk(path):
         lof.append(filenames)
    

    Can be:

    lof = [filenames for (dirpath, dirnames, filenames) in os.walk(path)]
    

    Also, x[:len(x) - 4] is not very robust for removing the extension from filenames (as you can have files with 4 letters like .html, .docx, etc.). Use the os library function for splitting extensions:

    lof = [os.path.splitext(x)[0] for x in lof[0]]