Search code examples
bashhashechoprocessing-efficiency

Most efficient way to hash each line of a text file?


I'm currently writing a Bash script which hashes each line of a text file and outputs it into a new file with the format hash:orginalword. The script I have at the moment to do this is:

cat $originalfile | while read -r line; do
    hash="$(printf %s "$line" | $hashfunction | cut -f1 -d' ')"
    echo "$hash:$line" >> $outputlocation
done

I originally got the code for this from a very similar question linked here. The script works exactly as advertised; however, the problem is that even for extremely small text files (<15KB) it takes a very long time to process.

I would really appreciate it if someone could suggest a script which achieves exactly the same outcome but does so far more efficiently.

Thank you in advance for any help,

Kind regards, John


Solution

  • I'd be very wary of doing this in pure shell. The overhead of starting up the hashing function for every line is going to make it really slow on a large file.

    How about a short bit of Perl?

    perl -MDigest::MD5 -nle 'print Digest::MD5::md5_hex($_), ":", $_' <$originalfile >>$outputlocation
    

    Perl has a variety of Digest modules, so it is easy to use something less broken than MD5.

    perl -MDigest::SHA -nle 'print Digest::SHA::sha256_hex($_), ":", $_' <$originalfile >>$outputlocation
    

    If you want to use Whirlpool, you can install it from CPAN with

    cpan install Digest::Whirlpool
    

    and use it with

    perl -MDigest -nle '$ctx = Digest->new("Whirlpool"); $ctx->add($_); print $ctx->hexdigest(), ":", $_' <$originalfile >>$outputlocation