Search code examples
pythondictionarylookupfile-search

Efficient design to store lookup table for files in directories


Let's say I have three directories dir1, dir2 & dir3, with thousands of files in each. Each file has a unique name with no pattern.

Now, given a filename, I need to find which of the three directories it's in. My first thought was to create a dictionary with the filename as key and the directory as the value, like this:

{'file1':'dir1', 
 'file2':'dir3',
 'file3':'dir1', ... }

But seeing as there are only three unique values, this seems a bit redundant and takes up space.

Is there a better way to implement this? What if I can compromise on space but need faster lookup?


Solution

  • A simple way to solve this is to query the file-system directly instead of caching all the filenames in a dict. This will save a lot of space, and will probably be fast enough if there only a few hundred directories to search.

    Here is a simple function that does that:

    def find_directory(filename, directories):
        for directory in directories:
            path = os.path.join(directory, filename)
            if os.path.exists(path):
                return directory
    

    On my Linux system, when searching around 170 directories, it takes about 0.3 seconds to do the first search, and then only about 0.002 seconds thereafter. This is because the OS does file-caching to speed up repeated searches. But note that if you used a dict to do this caching in Python, you'd still have to pay a similar initial cost.

    Of course, the subsequent dict lookups would be faster than querying the file-system directly. But do you really need that extra speed? To me, two thousandths of second seems easily "fast enough" for most purposes. And you get the extra benefit of never needing to refresh the file-cache (because the OS does it for you).

    PS:

    I should probably point out that the above timings are worst-case: that is, I dropped all the system file-caches first, and then searched for a filename that was in the last directory.