Search code examples
pythonfileglob

Python - Check for exact string in file name


I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)

For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'],and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]

here is what I have (in this case I only want to return all mp4 file type)

for (root, dirs, file) in os.walk(source_folder):
    for f in file:
        if '.mp4' and ('4') in f:
            print(f)

Tried == instead of in


Solution

  • Judging by your inputs, your desired regular expression needs to meet the following criteria:

    1. Match the number provided, exactly
    2. Ignore number matches in the file extension, if present
    3. Handle file names that include spaces

    I think this will meet all these requirements:

    def generate(n):
        return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')
    
    def check_files(n, files):
        regex = generate(n)
        return [f for f in files if regex.fullmatch(f)]
    

    Usage:

    >>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
    ['ep 4', 'img4', '4xxx', 'file 4.mp4']
    

    Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.

    This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you're searching for is in the name part. Because of the greedy nature of regular expressions, if you allow for file names with . then you won't be able to exclude extensions from the search. Ergo, modifying this to match ep.4 would also cause it to match file.mp4. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:

    def generate(n):
        return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')
    
    def strip_extension(f):
        return f.removesuffix('.mp4')
    
    def check_files(n, files):
        regex = generate(n)
        return [f for f in files if regex.fullmatch(strip_extension(f))]
    

    Note that this solution now includes the . in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.

    As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:

    def generate(n):
        return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')