Search code examples
pythonglob

Glob to print missing files list


What am I missing here?

I am trying to print a list of missing files, as example I am using 'WorhSheet_X' and 'WorkSheet_Y' as my expected files, I want to be able to print the name of files that are missing, if there are missing files on folder.

Else, just a why I am trying this to create a code that sends an e-mail automatically when there are missing files, with the missing file names on the body of the e-mail.

import glob

dir_to_search = r'G:\folder'

files_in_dir = glob.glob("{}{}".format(dir_to_search,'*.xls?'))

list_of_files = glob.glob('WorkSheet_X*','WorkSheet_Y*', recursive=True)

missing_files = [x for x in list_of_files if x not in files_in_dir]

print(missing_files)

Got error:

Traceback (most recent call last): ...
list_of_files = glob.glob('WorkSheet_X*','WorkSheet_Y*', recursive=True)
TypeError: glob() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given

EDIT:

I need to search the files with a partial name 'WorkSheet_X*' because every day there is a different date after the 'X' in 'WorkSheet_X'.


Solution

  • You are comparing full path names of files that exist in different directories because as your code now stands, you are doing a search for patterns WorkSheet_X* and WorkSheet_Y* in the current working directory, which would be different than dir_to_search (if it weren't, I am not sure what the point of this program is). Anyway, this code allows the current working directory to be some directory other than dir_to_search. So this code splits the full path names of the files and just compares file names and also attempts to make some optimizations (and corrections to your code):

    import glob, itertools, os.path
    
    dir_to_search = r'G:\folder'
    
    # Create a set from the list of files to make searching more efficient but use only filename:
    files_in_dir = {os.path.split(f)[1] for f in glob.iglob(os.path.join(dir_to_search, '*.xls?'))}
    """
    Use itertools.chain to combine calls to rglob.
    So rather than building an in-memory list, we are building a generator that
    will return the filenames as we need them. This is more efficient if there are a lot of files.
    """
    list_of_files = itertools.chain(glob.glob('WorkSheet_X*'), glob.glob('WorkSheet_Y*'))
    # but we now must separate the file name from the full path sepcification:
    missing_files = [f for f in list_of_files if os.path.split(f)[1] not in files_in_dir]
    
    print(missing_files)
    

    If we are talking about a really large number of files constituting the pattern '*.xls?' in the dir_to_search directory, it might be better not to create the files_in_dir set at all and instead do a directory look up for each candidate file:

    missing_files = [f for f in list_of_files if not os.path.isfile(os.path.join(dir_to_search, os.path.split(f)[1]))]
    

    There is a subtle difference however. Suppose we find a file named WorkSheet_X1.csv and it does exist in the dir_to_search directory. The first method will show it as missing because it does not match the pattern *.xls?. However, it will not be reported as missing using the second method by dint of it existing in the correct directory. Should the glob pattern being used really be 'WorkSheet_X*.xls?'?