Search code examples
pythonpathlib

How can I recursively iterate through a directory in Python while ignoring some subdirectories?


I have a directory structure on my filesystem, like this:

folder_to_scan/
    important_file_a
    important_file_b
    important_folder_a/
        important_file_c
    important_folder_b/
        important_file_d
    useless_folder/
        ...

I want to recursively scan through folder_to_scan/, and get all the file names. At the same time, I want to ignore useless_folder/, and anything under it.

If I do something like this:

path_to_search = Path("folder_to_scan")
[pth for pth in path_to_search.rglob("*") if pth.is_file() and 'useless_folder' not in [parent.name for parent in pth.parents]]

It will work (probably - I didn't bother trying), but the problem is, useless_folder/ contains millions of files, and rglob will still traverse all of them, take ages, and only apply the filter when constructing the final list.

Is there a way to tell Python not to waste time traversing useless folders (useless_folder/ in my case)?


Solution

  • You can easily write your own file iterator using recursion.

    def useless(path):
        # your logic to discard folders goes here
        ...
    
    def my_files_iter(path):
        if path.is_file():
            yield path
        elif path.is_dir():
            if useless(path):
                return
            for child_path in path.iterdir():
                yield from my_files_iter(child_path)