I wrote a Python script to compare files in two directories, using filecmp.cmp
. It works, but just now I tried running it for a collection of huge files. It was very slow.
The documentation says that when the shallow
parameter is true (which it is, by default), filecmp.cmp
should only compare the os.stat
results.
The script ran much faster for another big collection of jpg
files. I am wondering why the file size is having a larger effect than the number of files as if it checks os.stat
only.
I think the documentation for the shallow
parameter is misleading*. Passing shallow = True
does not necessarily prevent the filecmp.cmp
function from comparing the contents of the files. If your files are the same size but have different mtime
s, their contents will still be checked.
You can see the implementation of cmp
in your Python installation, or you can look at the (current as of this moment) source in the Python source repository.
Here are the relevant bit of cmp
:
def cmp(f1, f2, shallow=True):
# long docstring removed
s1 = _sig(os.stat(f1))
s2 = _sig(os.stat(f2))
if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
return False
if shallow and s1 == s2:
return True
if s1[1] != s2[1]:
return False
# rest of function, which calls a helper to do the actual file contents comparisons
The _sig
helper function returns a tuple of values extracted from the stat
data structure for a file. The tuple values are the file type, the file size, and its mtime
(usually the last time the file contents were modified).
The tests I included in the code excerpt try to quickly determine if two files are the same based on those pieces of metadata. If either file is not a "regular" file (because it's a directory, or special system file), they're considered unequal. Also, if they're not the same size, they cannot possibly equal.
What the shallow
parameter does is allow a quick positive test. If shallow
is true and the files have the same size and mtime
, filecmp.cmp
will assume the files are equal.
What I suspect is happening in your program is that your current directory has a number of files that are exactly the same size (perhaps because of very similar contents, or because the file size is fixed by the data format). Your previous data sets did not have as many same-sized files, so your code was able to quickly rule them out.
* I think filecmp.cmp
's docstring is so misleading that it indicates a bug (either because it doesn't describe the behavior properly, or because the actual implementation is incorrect and should be fixed to match the docs). And it looks like I'm not alone. Here is a bug report on this issue, though it hasn't been updated in several years. I'll ping the bug with a link to this question and maybe somebody will work on fixing it!