I am looking to compress a lot of data spread across loads of sub-directories into an archive. I cannot simply use built-in tar functions because I need my Perl script to work in a Windows as well as a Linux environment. I have found the Archive::Tar
module but their documentation gives a warning:
Note that this method [
create_archive()
] does not writeon the fly
as it were; it still reads all the files into memory before writing out the archive. Consult the FAQ below if this is a problem.
Because of the sheer size of my data, I want to write 'on the fly'. But I cannot find useful information in the FAQ about writing files. They suggest to use the iterator iter()
:
Returns an iterator function that reads the tar file without loading it all in memory. Each time the function is called it will return the next file in the tarball.
my $next = Archive::Tar->iter( "example.tar.gz", 1, {filter => qr/\.pm$/} ); while( my $f = $next->() ) { print $f->name, "\n"; $f->extract or warn "Extraction failed"; # .... }
But this only discusses the reading of files, not the writing of the compressed archive. So my question is, how can I take a directory $dir
and recursively add it to an archive archive.tar.bz2
with bzip2 compression in a memory-friendly manner, i.e. without first loading the whole tree in memory?
I tried to build my own script with the suggestions in the comments using Archive::Tar::Streamed
and IO::Compress::Bzip2
, but to no avail.
use strict;
use warnings;
use Archive::Tar::Streamed;
use File::Spec qw(catfile);
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);
my ($in_d, $out_tar, $out_bz2) = @ARGV;
open(my $out_fh,'>', $out_tar) or die "Couldn't create archive";
binmode $out_fh;
my $tar = Archive::Tar::Streamed->new($out_fh);
opendir(my $in_dh, $in_d) or die "Could not opendir '$in_d': $!";
while (my $in_f = readdir $in_dh) {
next unless ($in_f =~ /\.xml$/);
print STDOUT "Processing $in_f\r";
$in_f = File::Spec->catfile($in_d, $in_f);
$tar->add($in_f);
}
print STDOUT "\nBzip'ing $out_tar\r";
bzip2 $out_tar => $out_bz2
or die "Bzip2 failed: $Bzip2Error\n";
Very quickly, my system runs out of memory. I have 32GB available in my current system, but it gets flooded almost immediately. Some files in the directory I am trying to add to the archive exceed 32GB.
So I wonder if even in the Streamed
class each file has to be read in memory completely before being added to the archive? I assumed the files themselves would be streamed in buffers to the archive, but perhaps it's simply that instead of first saving ALL files in memory, Streamed
allows to only need one file in memory completely, and then adding that to the archive, one by one?
Unfortunately, what you want is not possible in Perl:
I agree, it would be nice if this module could write the files in chunks and then rewrite the headers afterwards (to maintain the relationship of Archive::Tar doing the writing). You could maybe walk the archive backwards knowing you split the file into
N
entries, remove the extra headers, and update the first header with the sum of their sizes.At the moment the only options are: use
Archive::Tar::File
, split the data into manageable sizes outside ofperl
, or use thetar
command directly (to use it fromperl
, there's a nice wrapper on CPAN:Archive::Tar::Wrapper
).I don't think we'll ever have a truly non-memory-resident
tar
implementation in Perl based onArchive::Tar
. To be honest,Archive::Tar
itself needs to be rewritten or succeeded by something else.