Time complexity of finding duplicate files in bash

I had to write a Bash script to delete duplicate files today, using their md5 hashes. I stored those hashes as files in a temporary directory:

for i in * ; do
    hash=$(md5sum /tmp/msg | cut -d " " -f1) ;
    if [ -f /tmp/hashes/$hash ] ;
    then
        echo "Deleted $i" ;
        mv $i /tmp/deleted ;
    else
        touch /tmp/hashes/$hash ;
    fi ;
done

It worked perfectly, but led me to wonder: is it a time-efficient way of doing that? I initially thought of storing the MD5 hashes in a file, but then I thought "no, because checking whether a given MD5 is in this file requires to re-read it entirely every time". Now, I wonder: is it the same when using the "create files in a directory" method? Has the Bash [ -f ] check linear, or quasi-constant complexity when there are lots of file in the same directory?

If it depends on the filesystem, what's the complexity on tmpfs?

Solution

I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.

First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.

To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):

case 'e':
   unary_advance ();
   return stat (argv[pos - 1], &stat_buf) == 0;
... 
case 'f':                   /* File is a file? */
   unary_advance ();
   /* Under POSIX, -f is true if the given file exists
      and is a regular file. */
   return (stat (argv[pos - 1], &stat_buf) == 0
           && S_ISREG (stat_buf.st_mode));

So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.

For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).

tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).

Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.

Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.

storing them in memory

As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.

MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
   # already seen this file
fi

faster tests

And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.

what is /usr/bin/[

/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):

#define LBRACKET 1
#include "test.c"

so they're interchangeable.