Search code examples
perlmathentropy

Calculating the entropy of a 32mb file in Perl - What is the quickest method?


I have a 32,678kb encrypted bin file which I need the entropy of. I am using Perl as its part of a larger project.

I have so far used the following 'technique':

use Shannon::Entropy qw/entropy/;
my $file = "test.bin";
open(my $bin, "<", $file) or die $!; binmode $bin;
seek($bin, 0x000000, 0); 
read($bin, my $entropy, 0x01FFFFF0);
print entropy($entropy);

This yields an almost infinite wait time, to the point where I give up after 30+ minutes.

I cannot deviate from testing the entire file's entropy.

Is there any quicker way? Would splitting it, entropy-ing it and using some weird math to combine again result in the same entropy as if it were one file?


Solution

  • Here is the entropy function re written to avoid all the map calls

    sub entropy {
        my ($entropy, $len, $p, %t) = (0, length($_[0]));
        my @chars = split '', $_[0];
        $t{$_}++ foreach @chars;
    
        foreach (values %t) {
            $p = $_/$len;
            $entropy -= $p * log $p ;
        }       
    
        return $entropy / log 2;
    }
    

    It may work out faster for you

    I've had second thoughts about this. You don't actually need to slurp the file into memory. $len is the length of the file which can be got from -s $file_name and %t is the frequency table which can be calculated by reading in a block at a time. So a version of the function to calculate the entropy of a file would be

    sub file_entropy {
        my ($file_name) = @_;
    
        # Get number of bytes in file
        my $len = -s $file_name;
        my ($entropy, %t) = 0;
    
        open (my $file, '<:raw', $file_name) || die "Cant open $file_name\n";
    
        # Read in file 1024 bytes at a time to create frequancy table
        while( read( $file, my $buffer, 1024) ) {
            $t{$_}++ 
                foreach split '', $buffer;
    
            $buffer = '';
        }
    
        foreach (values %t) {
            my $p = $_/$len;
            $entropy -= $p * log $p ;
        }       
    
        return $entropy / log 2;
    }