What am I missing here?
I am trying to print a list of missing files, as example I am using 'WorhSheet_X' and 'WorkSheet_Y' as my expected files, I want to be able to print the name of files that are missing, if there are missing files on folder.
Else, just a why I am trying this to create a code that sends an e-mail automatically when there are missing files, with the missing file names on the body of the e-mail.
import glob
dir_to_search = r'G:\folder'
files_in_dir = glob.glob("{}{}".format(dir_to_search,'*.xls?'))
list_of_files = glob.glob('WorkSheet_X*','WorkSheet_Y*', recursive=True)
missing_files = [x for x in list_of_files if x not in files_in_dir]
print(missing_files)
Got error:
Traceback (most recent call last): ...
list_of_files = glob.glob('WorkSheet_X*','WorkSheet_Y*', recursive=True)
TypeError: glob() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given
EDIT:
I need to search the files with a partial name 'WorkSheet_X*' because every day there is a different date after the 'X' in 'WorkSheet_X'.
You are comparing full path names of files that exist in different directories because as your code now stands, you are doing a search for patterns WorkSheet_X* and WorkSheet_Y* in the current working directory, which would be different than dir_to_search
(if it weren't, I am not sure what the point of this program is). Anyway, this code allows the current working directory to be some directory other than dir_to_search
. So this code splits the full path names of the files and just compares file names and also attempts to make some optimizations (and corrections to your code):
import glob, itertools, os.path
dir_to_search = r'G:\folder'
# Create a set from the list of files to make searching more efficient but use only filename:
files_in_dir = {os.path.split(f)[1] for f in glob.iglob(os.path.join(dir_to_search, '*.xls?'))}
"""
Use itertools.chain to combine calls to rglob.
So rather than building an in-memory list, we are building a generator that
will return the filenames as we need them. This is more efficient if there are a lot of files.
"""
list_of_files = itertools.chain(glob.glob('WorkSheet_X*'), glob.glob('WorkSheet_Y*'))
# but we now must separate the file name from the full path sepcification:
missing_files = [f for f in list_of_files if os.path.split(f)[1] not in files_in_dir]
print(missing_files)
If we are talking about a really large number of files constituting the pattern '*.xls?'
in the dir_to_search
directory, it might be better not to create the files_in_dir
set at all and instead do a directory look up for each candidate file:
missing_files = [f for f in list_of_files if not os.path.isfile(os.path.join(dir_to_search, os.path.split(f)[1]))]
There is a subtle difference however. Suppose we find a file named WorkSheet_X1.csv
and it does exist in the dir_to_search
directory. The first method will show it as missing because it does not match the pattern *.xls?
. However, it will not be reported as missing using the second method by dint of it existing in the correct directory. Should the glob pattern being used really be 'WorkSheet_X*.xls?'
?