Search code examples
bashduplicatessynology

How to find identical files in a directory tree


I would like to identify the identical files in a directory tree of a Synology NAS.
Is there a way to do it robustly and efficiently?

Here's what I tried:

basedir=/volume1/bordel

find "$basedir" -type f -exec md5sum {} + |
sort -k1,1 |
uniq -d

But I get no output, which is impossible


Solution

  • If your uniq supports -D and -w options:

    find . -type f -exec md5sum {} + |
    sed 's/^\\\(.*\)/\1\\/'          |
    sort -k1,1                       |
    uniq -w32 -D                     |
    sed 's/\(.*\)\\$/\\\1/'
    

    The sed commands are to rectify the md5sum lines that begin with a backslash character. In some versions of md5sum, lines begin with a backslash if the filename contains a newline or backslash character (and those characters are escaped with backslashes in the filenames; \n and \\).

    The -w 32 option of uniq is to compare only 32 characters at the beginning of the lines, and the -D option prints all duplicated lines (in the first 32 characters). These options are GNU extensions.