Search code examples
pythonalgorithmrecursiontime-complexityspace-complexity

Time and space complexity of a File Recursion algorithm


Is my time and space analysis correct for this algorithm?

def find_files(suffix, path):
if suffix == '':
    return []

# Base condition
if len(os.listdir(path)) == 0:
    return []

path_elements = os.listdir(path)
path_files = [file for file in path_elements if ('.' + suffix) in file]
path_folders = [folder for folder in path_elements if '.' not in folder]

for folder in path_folders:
    path_files.extend(find_files(suffix=suffix, path=path + '/' + folder))

return path_files

# Tests
path_base = os.getcwd() + '/testdir'
print(find_files(suffix='c', path=path_base))

Explanation: File Recursion

Requirement:

The goal is to write code for finding all files under a directory (and all directories beneath it) that end with ".c"

Summary:

In order to implement this, I have used find_files function which will take suffix (file extension) and path (directory path where we need to search). In this function, I am recursively searching for a file with a given extension in a parent directory and all its sub-directories. I am storing all these files with a given suffix in a list and returning it.

Time and Space Complexity:

Time Complexity:

os.listdir(path) will list all the files and folders in a given path. Since it has to iterate over each item to provide a list, it takes linear time complexity. Let e be the length of the path_elements => O(len(path_elements)) => O(e)

path_elements = os.listdir(path)
  1. Below line will find out all the files in a given path by checking if there is a .c in the file-name. If s is a length of the string (name of the file/folder), then it takes O(s) time to find out a file with the given extension. Therefore, for f number of files, it takes O(f*s) time. For practical purposes such as this, it is fair to assume that file-names are NOT infinitely long. Therefore, we can assume s as a constant => O(f * 1) => O(f).
path_files = [file for file in path_elements if ('.' + suffix) in file]
  1. Similarly, traversing through g directories will take linear time => O(g)
path_folders = [folder for folder in path_elements if '.' not in folder]

The above 3 steps will take => O(e + f + g) => O(e). Since, e (len(path_elements)) is a dominant term as it is a combination of path_files and path_folders

  • For recursion, if k is the number of directories (branches) at each level in depth then the recursion will take O(brancher^depth) => O(k^depth). But for every recursive call, above three steps are performed. Therefore, total time complexity is O((k^depth) * e

Space Complexity:

  1. Input Space

    • suffix - O(1)
    • path - O(1)
    • path_elements + path_files + path_folders => O(e + f + g) => O(e)
  2. Auxiliary Space

    Recursion uses something called as "call stack". When a function calls itself, that function goes on top of the stack.

    • In this algorithm, the recursive function is exhausted only if it has traversed every directory. In other words, when it has reached the depth. Therefore O(depth) space is required in the call stack.
  3. Total Space Complexity => Input Space + Auxiliary Space => O(e + depth)


Solution

  • I think the complexity is O(path_elements), where path_elements is the total number of files and folders returned by all calls to os.listdir(). Consider a folder that has four files in a single directory:

    folder
      zip.c
      foo.c
      bar.c
      bas.c
    

    So your function calls os.listdir(), which returns four files. So n = 4. The code then iterates over the list to find folders, of which it finds none, and again to pick out the .c files, which it adds to the list.

    The complexity here is O(n) to call os.listdir(), O(n) to search for folders, and O(n) to pick out the .c files and add them to the list. Altogether O(3*n), which is O(n).

    Now consider this directory tree:

    folder
      sub1
        zip.c
        foo.c
      sub2
        bar.c
        bas.c
    

    So the first call to find_files calls os.listdir(folder), which returns two folders. Each folder makes a recursive call to find_files. Altogether there are three calls to os.listdir(), and each one returns two files, so the total number of items returned by os.listdir() is 6.

    Now imagine you had:

    folder
      sub0
        sub1
          sub2
            sub3
              ...
                sub30
                  file1.c
    

    The complexity here is still O(n). In this case, n is 31.

    The point I'm trying to make here is that you look at each item returned from os.listdir() a constant number of times. Thus, O(n).