Speed and Memory managment of C vs Perl

Context: Currently, my team has a perl script that does many things, one of which is that it stores 1 byte hex-values in a hashmap(array yields similar results). The input data can range from hundreds of MB to 10s of GBs. At the moment, when we run a 1GB input (1 billion entries), the script takes about 10 minutes and then errors out due to using all 16GB of my RAM. I'm told a 1GB input can expand to nearly 300GB.

We then made a comparable C program, and found that it takes a few minutes and uses only 1.1GBs.

I wrote the code below in order to simply test how C and Perl perform for writing 1 Billion values. I'm finding that the Perl code takes around 186seconds and >70GB of memory to run, while the C code only takes two seconds and 1GB. I use time and memusage to determine measurements.

Question: Is perl actually this slow and bad at memory management, or am I missing something. The literature i've read online, has the Perl should be slower due to the flexibility that it provides, but relatively not super slow since it is written in C.

Perl code example of memory usage:

use strict;

my @list;

for(my $a=0; $a < 1000000000;$a++){
    $list[$a]=1; # 1 is just to simulate some data.
}
print 'done';

C code:

#include <stdlib.h>

int main() {

  int size = 1000000000;
    unsigned char *data = (unsigned char*)malloc(size*sizeof(unsigned char));

    unsigned char byte = 'a';
    int address = 0;
    while (address < size) {                                                                                                                  
      data[address]=byte;
      address++;

    }
    printf("done %i.\n",address);

    return 0;
}

Also tried in python, which was worse than perl in terms of speed.

data = []

d = format(231,'#04x')

while address < 1000000000:
    data.append(d)
    address += 1
print "done"
while(1):
    continue

note: I haven't used a profiler yet since the evaluation code is simple.

Because of these performance issues, I found a solution where called SWIG which allows me to wrap C code and run it in perl; however, I have some follow up questions about it. :)

edit: tag

Solution

An array (scalar of type SVt_PVAV) takes 64 bytes on my system.

$ perl -Mv5.10 -MDevel::Size=size -e'my @a; say size( \@a );'
64

This includes the the fields common to all variables (refcount, variable type, flags, pointer to the body), plus the fields specific to SVt_PVAV (total size, size used, pointer to the underlying array of pointers).

This doesn't include the actual pointers to the scalars it contains.

The size of a scalar that can only contain an integer (SVt_IV) is 24 bytes on my system.

$ perl -Mv5.10 -MDevel::Size=size -e'my $i = 1; say size( $i );'
24

This includes the the fields common to all variables: (refcount, variable type, flags, pointer to the body), plus the fields specific to SVt_IV (the integer).

So we're talking about 64 + 1,000,000,000 * ( 8 + 24 ) = 32e9 bytes. Plus intentional over-allocation of the array (to avoid having to realloc each time you add an element). Plus the overhead of 1,000,000,003 memory blocks. It's not unimaginable that this would take a total of 70e9 bytes.

As for the speed, all these allocations add up. And of course, you're doing arithmetic on a scalar, not an int. This involves pointers, type checks and flag checks every single time you increment it.

There is a price to the convenience of variables which can hold data of any type, arrays that can be expanded at will, and automatic memory deallocation. But the benefits are also immense.