Why is filecmp.cmp slow for huge files even when its 'shallow' parameter is True?

I wrote a Python script to compare files in two directories, using filecmp.cmp. It works, but just now I tried running it for a collection of huge files. It was very slow.

The documentation says that when the shallow parameter is true (which it is, by default), filecmp.cmp should only compare the os.stat results.

The script ran much faster for another big collection of jpg files. I am wondering why the file size is having a larger effect than the number of files as if it checks os.stat only.

Solution

I think the documentation for the shallow parameter is misleading*. Passing shallow = True does not necessarily prevent the filecmp.cmp function from comparing the contents of the files. If your files are the same size but have different mtimes, their contents will still be checked.

You can see the implementation of cmp in your Python installation, or you can look at the (current as of this moment) source in the Python source repository.

Here are the relevant bit of cmp:

def cmp(f1, f2, shallow=True):
    # long docstring removed

    s1 = _sig(os.stat(f1))
    s2 = _sig(os.stat(f2))
    if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
        return False
    if shallow and s1 == s2:
        return True
    if s1[1] != s2[1]:
        return False

    # rest of function, which calls a helper to do the actual file contents comparisons

The _sig helper function returns a tuple of values extracted from the stat data structure for a file. The tuple values are the file type, the file size, and its mtime (usually the last time the file contents were modified).

The tests I included in the code excerpt try to quickly determine if two files are the same based on those pieces of metadata. If either file is not a "regular" file (because it's a directory, or special system file), they're considered unequal. Also, if they're not the same size, they cannot possibly equal.

What the shallow parameter does is allow a quick positive test. If shallow is true and the files have the same size and mtime, filecmp.cmp will assume the files are equal.

What I suspect is happening in your program is that your current directory has a number of files that are exactly the same size (perhaps because of very similar contents, or because the file size is fixed by the data format). Your previous data sets did not have as many same-sized files, so your code was able to quickly rule them out.

* I think filecmp.cmp's docstring is so misleading that it indicates a bug (either because it doesn't describe the behavior properly, or because the actual implementation is incorrect and should be fixed to match the docs). And it looks like I'm not alone. Here is a bug report on this issue, though it hasn't been updated in several years. I'll ping the bug with a link to this question and maybe somebody will work on fixing it!