Search code examples
filesystemsdiffmd5sum

How to diff md5 sums of two filesystem states?


I'm collecting md5sum snapshots of the same filesystem at two different points in time. (ie, Before and after an infection.) I need to diff these two states in order to see what files change between these two points in time.

To collect these states I might do the following (on macOS with SIP turned off):

sudo gfind / ! -path '*/dev/*' ! -path '*/Network/*' ! -path '*/Volumes/*' ! -path '*/.fseventsd/*' ! -path '*/.Spotlight-V100/*' -type f -exec md5sum {} \; > $(date "+%y%m%d%H%M%S").system_listing

The problem I'm having is that the resultant files are around 100MB a piece and using diff by itself seems to compare chunks instead of each individual file's md5sum in the output.

Is there an efficient way of using diff tools to do this or is it necessary to write a script to somehow compare the two files based upon filename paths, effectively recreating diff to compare lines with path as the unique comparator value and then return info based on the associated md5sum?


Solution

  • appearance of directories order could produce a lot of noisy diff
    for example i ran the following two commands , diffing two directories full of pdfs.
    one with 1 file , the other with tens of files swapping the directory order produce 2 diff line,
    instead we want to the diff report the fact of no diff .

    find books/ docs-pdf/ -type f  -exec  md5sum {} \; > snapshot1
    find  docs-pdf/ books/ -type f  -exec  md5sum {} \; > snapshot2
    
    diff snapshot1 snapshot2
    --- snapshot1
    +++ snapshot2
    @@ -1,4 +1,3 @@
    -83322cb1aaa94f9c8e87925f9d2a695e  books/ModSimPy.pdf
     192e5d38e59d8295ec9ca715e784a6d0  docs-pdf/c-api.pdf
     76c5bfb41bc6e5f9c8da1ab1f915e622  docs-pdf/distributing.pdf
     0a630ec314653c68153f5bbc4446660c  docs-pdf/extending.pdf
    @@ -25,3 +24,4 @@
     31e3dc3f78a12c59cdc0426d8e75ec99  docs-pdf/tutorial.pdf
     4c59e969009b6c3372804efdfc99e2d9  docs-pdf/using.pdf
     cf5330f4ed5ca5f63f300ccfa3057825  docs-pdf/whatsnew.pdf
    +83322cb1aaa94f9c8e87925f9d2a695e  books/ModSimPy.pdf
    
    
    

    after sorting by 2nd column , diff successfully report with no diff

    sort  -k2  snapshot1 >sorted.snapshot1 
    sort  -k2  snapshot2 >sorted.snapshot2
    diff sorted.snapshot1 sorted.snapshot2
    
    

    if this did not solve all noisy diff outputs , please post out pieces of the example output you do not want