Search code examples
loopswildcardglobsubdirectoryos.walk

Searching in multiple folders for specific file if it exists or not using Wildcard


I've searched through web and asked people about my "simple" problem, but I got no satisfactoy answer.

My problem is following:

I'm comparing data from Exhange (in form of excel files, that are saved in some folders) and data in my system (SQL query from database). I'm designing a tool that compares data FROM - TO specific DATE. All my exchange data filenames are based on specific date format, some string and excel file format varies (sometimes .xls, .xlsx, .xlsm).

Obviously what I need to do is to write a loop to search for needed files from "FROM" date to "TO" date. Let's take a period from 7th July 2020 to 13th July 2020. Let's say, that files for 11th July 2020 are missing. Bear in mind, that my files are stored in some location with multiple subfolders named by MONTH etc.

Example:

C:\Users\VB\Desktop\VB\python\05
C:\Users\VB\Desktop\VB\python\06
C:\Users\VB\Desktop\VB\python\07

Here are some examples of my filenames:

07.07.2020 - BestScore.xls
07.07.2020 - WorstScore.xlsx
08.07.2020 - BestScore.xls
08.07.2020 - WorstScore.xlsx
09.07.2020 - BestScore.xls
09.07.2020 - WorstScore.xls
10.07.2020 - BestScore.xls
10.07.2020 - WorstScore.xlsm
12.07.2020 - BestScore.xls
12.07.2020 - WorstScore.xlsx

My basic code looks like this:

import os
from datetime import timedelta

startD = date(2020,7,10)
day= timedelta(days=1)
EndD = date(2020,7,13)

folder = 'C:\Users\VB\Desktop\VB\python'

while startD <= EndD:
    
    date=(startD.strftime("%d.%m.%Y"))
    file = date + '-BestScore'
    file2 = date + '-Worstscore'

    **code IF file or file2 is found ---> do something **
    ** ELSE IF file or file2 is not found ---> print(file or file2 not found)

Problem occurs because I must use Wildcards and need to search through multiple folders (sometimes I need to comapare data for few months backward, so searching in different folder is a must).

I have tried to use different functions for looping through multiple folder:

  • os.walk()
  • glob.glob()
  • glob2.iglob()

but none of them works the way I want. While looping, these functions check each file for wildcard name and obviously return "else if" statement for above for EACH filename that is not exactly named:

no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713
no file for 20200713

I don't need to check each file if it is the right one, I just want to receive results like this:

found 07.07.2020 - BestScore.xls
found 07.07.2020 - WorstScore.xlsx
found 08.07.2020 - BestScore.xls
found 08.07.2020 - WorstScore.xlsx
found 09.07.2020 - BestScore.xls
found 09.07.2020 - WorstScore.xls
found 10.07.2020 - BestScore.xls
found 10.07.2020 - WorstScore.xlsm
NOT found 11.07.2020 - Bestscore
NOT found 11.07.2020 - Worstscore
found 12.07.2020 - BestScore.xls
found 12.07.2020 - WorstScore.xlsx

To sum up, I need a solution to search in multiple subfolders with Wildcard* and NOT to check every file with IF statement.

I'm learning python for few months and I think this should not be a great problem to solve, but I'm kinda confused about it. Solving this problem would complete my project, since everything else is already working :)

I would be very glad for any help.

Thanks.


Solution

  • Your question is rather vague: it would be nice if you could provide some more details on the file system structure..

    Regardless, I interpret you question as the following: Given a directory to search in and two dates (start and finish), you want to search for two different files (BestScore and WorstScore) within those dates. If they exist, do something, else, print warning.

    Example:

    • Start date: 07.07.2020
    • End state: 08.07.2020
    • Search dir: ~/some/dir
    • Allowed file-extensions: xls, xlsm, xlsx

    This means that we are looking for four files:

    • ~/some/dir/ ..... /07.07.2020 - BestScore.xls
    • ~/some/dir/ ..... /07.07.2020 - WorstScore.xlsx
    • ~/some/dir/ ..... /08.07.2020 - BestScore.xlsm
    • ~/some/dir/ ..... /08.07.2020 - WorstScore.xlsx

    .. and any of them may, or may not, exist within some sub directory of ~/some/dir. We are uninterested in any other files from other dates regardless of their existance or not.

    First, we'll need some helper functions. We start off with whatever you want to do to existing files, here represented by a print,

    def do_something_with(file_path):
      # do something with file ..
      print("doing something with '%s' .." % file_path)
    

    A function for determining if a directory entry is a file and of the correct file type,

    import os
    def is_xls_file(file_path):
      return (os.path.isfile(file_path) and (
        file_path.endswith(".xls") or
        file_path.endswith(".xlsm") or
        file_path.endswith(".xlsx")))
    

    A function for creating a dict of the files we are searching for,

    from datetime import date, timedelta
    def files_to_find(start_date, end_date, filenames):
      files = {}
      d = start_date
      while d != end_date:
        for fn in filenames:
          files["%s - %s" % (d.strftime("%d.%m.%Y"), fn)] = None
        d += timedelta(days=1)
      return files
    

    And then the actually search function: We perform a os.walk(), iterating over all files and subdirectories. If we find the files we're looking for, we store their paths in the files_to_finddictionary.

    def find_files(files_to_find, search_dir):
      for dirpath, subdirs, files in os.walk(search_dir):
        for f in files:
          for ftf in files_to_find:
            # add .lower() for case-insensitivity
            if ftf.lower() in f.lower() and is_xls_file(os.path.join(dirpath,f)):
              files_to_find[ftf] = os.path.join(dirpath,f)
      return files_to_find
    

    We may iterate over the file_to_find dictionary and perform whatever operations needed on the files that existed and print a warning for those that did not,

    startD = date(2020,7,10)
    EndD = date(2020,7,13)
    filenames = ["bestscore", "worstscore"] # the search is case-insensitve
    search_dir = "./fold1/fold2"
    
    to_find = files_to_find(startD, EndD, filenames)
    found = find_files(to_find, search_dir)
    
    for f,abs_path in found.items():
      if abs_path is None:
        print("Was unable to find '%s'" % f)
      else:
        do_something_with(abs_path)
    

    Below is a screen shot of an example run with results from the displayed file system, using the input from above. As explained earlier, the script will in this example be searching for six files (i.e. the Best- and WorstScore) for each date. Thus, it prints exactly six events, one for each file: did it find it, or not..

    example run