Search code examples
pythondirectory-structuresubdirectoryscandir

Python 2.7 - Using scandir to traverse all sub-directories and return list


Using Python 2.7 and scandir, I need to traverse down all directories and subdirectories and return a list of the directories ONLY. Not the files. The depth of subdirectories in along a path may vary.

I am aware of os.walk, but my directory has 2 million files and therefore os.walk is to slow for this.

Currently the code below works for me, but I suspect there may be an easier way/loop to achieve the same result, and I'd like to know how it can be improved. Also the limitation of my function is that it is still limited by the depth I can traverse into the sub-directories, and perhaps this can be overcome.

def list_directories(path):
dir_list = []
for entry in scandir(path):
    if entry.is_dir():
        dir_list.append(entry.path)
        for entry2 in scandir(entry.path):
            if entry2.is_dir():
                dir_list.append(entry2.path)
                for entry3 in scandir(entry2.path):
                    if entry3.is_dir():
                        dir_list.append(entry3.path)
                        for entry4 in scandir(entry3.path):
                            if entry4.is_dir():
                                dir_list.append(entry4.path)
                                for entry5 in scandir(entry4.path):
                                    if entry5.is_dir():
                                        dir_list.append(entry5.path)
                                        for entry6 in scandir(entry5.path):
                                            if entry6.is_dir():
                                                dir_list.append(entry6.path)
return dir_list
for item in filelist_dir(directory):
    print item

Please let me know if you have a better alternative to quickly returning all directories and sub-directories in a path that has millions of files.


Solution

  • scandir supports a walk() function that includes the same optimizations of scandir() so it should be faster than os.walk(). (scandir's background section suggests a 3-10 time improvement on Linux/Mac OS X.)

    So you could just use that... Something like this code might work:

    from scandir import walk
    
    def list_directories(path):
        dir_list = []
        for root, _, _ in walk(path):
            # Skip the top-level directory, same as in your original code:
            if root == path:
                continue
            dir_list.append(root)
        return dir_list
    

    If you want to implement this using scandir() instead, in order to implement something that supports an arbitrary depth you should use recursion.

    Something like:

    from scandir import scandir
    
    def list_directories(path):
        dir_list = []
        for entry in scandir(path):
            if entry.is_dir() and not entry.is_symlink():
                dir_list.append(entry.path)
                dir_list.extend(list_directories(entry.path))
        return dir_list
    

    NOTE: I added a check for is_symlink() too, so it doesn't traverse symlinks. Otherwise a symlink pointing to '.' or '..' would make this recurse forever...

    I still think using scandir.walk() is better (simpler, more reliable), so if that suits you, use that instead!