Search code examples
bashshellmd5md5sum

Do md5sum on files inside a directory and check if there are identical files


I'm studying shell script and there is an exercise asking to calculate the md5 hash of all files of a folder. It also asks to, in case there's two files with the same hash, print their names in the terminal. My code can do that, but once it finds a match, it's printed twice. I can't figure out how to exclude the first file name from the next iterations. Another thing: It's forbidden to create any temporary files to help with the task.

#!/bin/bash

ifs=$IFS
IFS=$'\n'

echo "Verifying the files inside the directory..."

for file1 in $(find . -maxdepth 1 -type f | cut -d "/" -f2); do
  md51=$(md5sum $file1  | cut -d " " -f1)
  for file2 in $(find . -maxdepth 1 -type f | cut -d "/" -f2 | grep -v "$file1"); do
    md52=$(md5sum $file2 | cut -d " " -f1)
    if [ "$md51" == "$md52" ]; then
      echo "Files $file1 e $file2 are the same."
    fi
  done
done

I also would like to know if there is a more efficient way to do this task.


Solution

  • This

    mapfile -t list < <(find . -maxdepth 1 -type f -exec md5sum {} + | sort)
    mapfile -t dups < <(printf "%s\n" "${list[@]}" | grep -f <(printf "^%s\n" "${list[@]}" | sed 's/ .*//' | sort | uniq -d))
    
    # here the array dups containing the all duplicates along with their md5sum
    # you can print the array using a simple
    printf "%s\n" "${dups[@]}"
    

    and will get output like:

    3b0332e02daabf31651a5a0d81ba830a  ./f2.txt
    3b0332e02daabf31651a5a0d81ba830a  ./fff
    c9eb23b681c34412f6e6f3168e3990a4  ./both.txt
    c9eb23b681c34412f6e6f3168e3990a4  ./f_out
    d41d8cd98f00b204e9800998ecf8427e  ./aa
    d41d8cd98f00b204e9800998ecf8427e  ./abc def.xxx
    d41d8cd98f00b204e9800998ecf8427e  ./dudu
    d41d8cd98f00b204e9800998ecf8427e  ./start
    d41d8cd98f00b204e9800998ecf8427e  ./xx_yy
    

    The following addition is just for a fancier printout

    echo "duplicates:"
    while read md5; do
            echo "$md5"
            printf "%s\n" "${dups[@]}" | grep "$md5" | sed 's/[^ ]* /  /'
    done < <(printf "%s\n" "${dups[@]}" | sed 's/ .*//' | sort -u)
    

    will print something like:

    3b0332e02daabf31651a5a0d81ba830a
       ./f2.txt
       ./fff
    c9eb23b681c34412f6e6f3168e3990a4
       ./both.txt
       ./f_out
    d41d8cd98f00b204e9800998ecf8427e
       ./aa
       ./abc def.xxx
       ./dudu
       ./start
       ./xx_yy
    

    Warning: will work only if the filenames doesn't contains the \n (newline) character. Modifying the script be general needs bash 4.4+, where the mapfile knows the -d parameter.