Search code examples
bashshellcommand-linemd5sum

Find the files with same md5sum and print ones alike in the same line


I'm trying to do one written in the title, I explain with one example:

Tree directory: (A B C D H F G are my file)

dir0/
dir0/A    //MD5sum equal MD5sum B
dir0/C
dir0/D   // MD5sum equal MD5sum F G
dir0/dir1/B  // MD5sum equal MD5sum A
dir0/dir1/H
dir0/dir1/dir2/G  //MD5sum equal MD5sum F D
dir0/dir1/dir2/F  //MD5sum equal MD5sum G D

with this command:

find dir0/ -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=prepend | awk '{ print $2 }'

I search all file in a dir0 and subdir, calculating the MD5sum, sort , select only files equals and divided into groups, print only path files.

Ok this works and I have this output:

dir0/A        ]
dir0/dir1/B   ] first group

dir0/D             ]
dir0/dir1/dir2/F   ]
dir0/dir1/dir2/G   ] second group

how can I have an output in the following way?(each file with the same MD5sum in the same line, obviously without " first, second ... group")

dir0/A dir0/dir1/B  ] first group
dir0/D dir0/dir1/dir2/F dir0/dir1/dir2/G ] second group

Solution

  • The shortest way to do this would be to add a pipeline step like this:

    awk 'BEGIN{RS=RS RS}{$1=$1}1'
    

    RS = RS RS causes Awk to use "\n\n" as its record separator, thus reading each block as a single record. The FS field separator is whitespace, which includes newlines, so we don't have to do any work to split the lines.

    $1 = $1 doesn't really change the value of $1, but Awk thinks it could have, which means it'll reconstruct $0 (which currently has newlines in it) from $1, $2, etc., joining with OFS (which is " " by default).

    1 causes Awk to print $0 (and ORS, which is still a single newline) on every record.