Search code examples
perlarchivetarbzip2archive-tar

How to add complete tree structure into a .tar.bz2 file with Perl?


I am looking to compress a lot of data spread across loads of sub-directories into an archive. I cannot simply use built-in tar functions because I need my Perl script to work in a Windows as well as a Linux environment. I have found the Archive::Tar module but their documentation gives a warning:

Note that this method [create_archive()] does not write on the fly as it were; it still reads all the files into memory before writing out the archive. Consult the FAQ below if this is a problem.

Because of the sheer size of my data, I want to write 'on the fly'. But I cannot find useful information in the FAQ about writing files. They suggest to use the iterator iter():

Returns an iterator function that reads the tar file without loading it all in memory. Each time the function is called it will return the next file in the tarball.

my $next = Archive::Tar->iter( "example.tar.gz", 1, {filter => qr/\.pm$/} );
while( my $f = $next->() ) {
    print $f->name, "\n";
    $f->extract or warn "Extraction failed";
    # ....
}

But this only discusses the reading of files, not the writing of the compressed archive. So my question is, how can I take a directory $dir and recursively add it to an archive archive.tar.bz2 with bzip2 compression in a memory-friendly manner, i.e. without first loading the whole tree in memory?

I tried to build my own script with the suggestions in the comments using Archive::Tar::Streamed and IO::Compress::Bzip2, but to no avail.

use strict;
use warnings;

use Archive::Tar::Streamed;
use File::Spec qw(catfile);
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);

my ($in_d, $out_tar, $out_bz2) = @ARGV;

open(my $out_fh,'>', $out_tar) or die "Couldn't create archive";
binmode $out_fh;

my $tar = Archive::Tar::Streamed->new($out_fh);

opendir(my $in_dh, $in_d) or die "Could not opendir '$in_d': $!";
while (my $in_f = readdir $in_dh) {
  next unless ($in_f =~ /\.xml$/);
  print STDOUT "Processing $in_f\r";
  $in_f = File::Spec->catfile($in_d, $in_f);
  $tar->add($in_f);
}

print STDOUT "\nBzip'ing $out_tar\r";

 bzip2 $out_tar => $out_bz2
    or die "Bzip2 failed: $Bzip2Error\n";

Very quickly, my system runs out of memory. I have 32GB available in my current system, but it gets flooded almost immediately. Some files in the directory I am trying to add to the archive exceed 32GB.

Memory exceeded

So I wonder if even in the Streamed class each file has to be read in memory completely before being added to the archive? I assumed the files themselves would be streamed in buffers to the archive, but perhaps it's simply that instead of first saving ALL files in memory, Streamed allows to only need one file in memory completely, and then adding that to the archive, one by one?


Solution

  • Unfortunately, what you want is not possible in Perl:

    I agree, it would be nice if this module could write the files in chunks and then rewrite the headers afterwards (to maintain the relationship of Archive::Tar doing the writing). You could maybe walk the archive backwards knowing you split the file into N entries, remove the extra headers, and update the first header with the sum of their sizes.

    At the moment the only options are: use Archive::Tar::File, split the data into manageable sizes outside of perl, or use the tar command directly (to use it from perl, there's a nice wrapper on CPAN: Archive::Tar::Wrapper).

    I don't think we'll ever have a truly non-memory-resident tar implementation in Perl based on Archive::Tar. To be honest, Archive::Tar itself needs to be rewritten or succeeded by something else.