Search code examples
bashrsyncmd5sum

Compare md5 of all files in directory excluding multiple hardlinks


I tend to ramble, so I apologise in advance if a bid to cut the chaff leads to less context (or I just fail miserably and ramble nonetheless).

I'm trying to improve some tools I wrote for rsyncing a large amount of data from one network storage location to another for archiving purposes (2nd network location is part of a much larger tape library system). Due to a large number of shared assets there are usually a large number of hard-linked files in the directories to move, and I use rsync to preserve those links.

Rsyncing in the region of 1TB of actual data that when hard-links are 'included' into the total can be 4 or 5 times bigger (ie 4 - 5TB) is not uncommon, or unexpected.

For various reasons, I need to hash the data in the source and compare to the destination data AND keep a record of that hash results (inc. hash). This is so if restored data is unexpectedly corrupt I can compare the hash of the restored data and the hash of the same file when it was originally rsynced to pinpoint when / if the corruption occurred.

After the rsync has happened, I use the following to md5 the source (any hash would do, but I chose md5 for no specific reason):

find . -type f -exec md5sum "{}" + > $temp_file

The output of $temp_file is echo'd into my main output file as well. Then move to the destination and run (its done that way, source first then destination, as if folders are being merged, it will only hash the files moved in this latest rsync):

md5sum -c $temp_file >> $output_file

All is well and good, and this does work EXCEPT, this will hash all the files, including hard-links, in effect, finding the md5 hash of the same files over and over again, which can add hours to the process overall.

Is there a way to edit the 'find....' command to ignore hardlinked files, BUT still hash the 'original' file from which the hard-links actually point to. I did look into the following:

find . -type f -links 1

But my concern is that ALL hard-link related files will be ignored, rather than listing the 'original' file that actually occupies the inode, and excluding all the files that subsequently point to that inode.

Am I right about -links 1 ignoring all hard-link related files, and if so, what can I do?


Solution

  • Unlike softlinks, hardlinks are regular files, each points to same inode number and conceptually there are no original or duplicate hardlinks.

    What you can do here is to use -samefile with find command to get all the same hardlinks, put into the ignorelist, and use this ignorelist to skip operation on duplicate.

    touch /tmp/duplicates
    find . -type f | while read f
    do
        if ! $(grep $f /tmp/duplicates &>/dev/null)
        then
            find . -samefile $f | grep -v $f >> /tmp/duplicates
            # put md5sum procedure for $f here
        fi
    done