Search code examples
bashmd5sum

How to create md5sum for new files


We've created a folder in my dad's computer for everyone in the family to deposit and share their photos and videos.

Example of directories:
/Family_Photo/Penguins/2017 09 02/
/Family_Photo/East Beach/2017 10 11/Seaside/
/Family_Photo/East Beach/2017 10 11/Games/

Using md5deep, I am able to create a complete list of checksum for all the files in all subdirectories

md5deep -r /Family_Photo/ > /Family_Photo/md5sum.log

Instead of every time regenerating the complete md5 checksum for all (newly added and existing) files,

How can I create a bash script to automatically detect any files that has not been md5 before and generate the checksum for these new files and append them the original md5sum.log


Solution

  • Solution

    This should do the trick:

    comm -1 -3 <(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) <(find /Family_Photo -type f | sort) | xargs --delimiter='\n' --no-run-if-empty md5deep | tee -a /Family_Photo/md5sum.log

    Notes

    • If you use a different path than the one in the example, make sure to use an absolute and canonical path or append the option -exec realpath {} \; to find, because md5deep seems to write such paths into the file and we need them to be identical for comparison.
    • This command line uses bash specific syntax (passing commands as files) and might not work in different shell interpreters.

    Explanation

    • comm -1 -3
      • We use this command in this specific case to see which files are new by comparing found files to the existing list.
      • comm compares two sorted lists and outputs which lines are unique to each and which are common to both
      • -1 means: don't show lines unique to first list
      • -3 means: don't show lines common to both files
      • as a result we only output lines unique to second list
    • <(grep --text --perl-regex --only-matching '(?<= ).+' /Family_Photo/md5sum.log | sort) As first file to comm we pass a list of the already hashed filenames.
      • <(...) is bash syntax to pass the result of a program as file argument
      • With grep we extract the file names from the existing file by matching whatever follows double-space
      • --text makes sure md5sum.log is always considered a text file and not skipped
      • --perl-regex use perl regular expression syntax (we need this for look-behind matching)
      • --only-matching only output text that matched the pattern, not the entire line with the match
      • '(?<= ).+' the matching pattern: (?<= ) "look-behind" pattern, checks if match was preceded by (two spaces); followed by .+ (any characters, one or more)
      • | sort we pass the output of grep to sort, because comm expects sorted lists
    • <(find /Family_Photo -type f | sort) As second file to comm we pass all files we find
      • <(...) is bash syntax to pass the result of a program as file
      • find will recurse a given directory and print out all file names
      • -type -f instructs find to only output the names of found files, not directories
      • | sort we pass the output of grep to sort, because comm expects sorted lists
    • | xargs --delimiter='\n' --no-run-if-empty md5deep The resulting list of new files is passed to md5deep
      • | connects the output of comm to the input of xargs
      • xargs will call a command (in this case md5deep) with whatever comes as input as argument
      • --delimiter='\n' specifies a new line as seperator, so that other whitespaces in file names won't get mistaken for a new argument
      • --no-run-if-empty we don't want to run md5deep if we don't have a single new filename to pass to it.
    • | tee --append /Family_Photo/md5sum.log The resulting list hashes will be written to the hash file
      • This displays the new files/hashes for your convenience while writing them, if you don't want to see them, just use >> /Family_Photo/md5sum.log instead.
      • | connects the output of md5deep to the input of tee
      • tee will output its input and also write it to a file
      • --append tells tee to not overwrite file contents, but to append instead