Search code examples
perltarcompressionbzip2

Can you stream file-per-file, line-per-line from a .tar.bz2 archive in Perl?


We have a lot of compressed data that actually is a compressed taped archive of directories and its subdirectories containing XML files; e.g.

omega/    
- alpha/
  - a/
    - file1.xml
    - file2.xml
    - file3.xml
  - b/
    - file1.xml
    - file2.xml
    - file3.xml
  - c/
    - ...
- beta/
  - a/
    - file1.xml
    - file2.xml
    - file3.xml
  - b/
    - ...
  - c/
    - ...
- gamma/
  - a/
    - ...
  - b/
    - ...
  - c/
    - ...

The result would be files such as omega.tar.bz2 and these files can reach size of hundreds of gigabytes.

Even though we are aware that this is an archive file type, it would be nice to still be able to use its contents when we need to. Therefore I was wondering if it is possible to read from these files in Perl in a streaming manner, i.e. without first having to unpack and decompress everything on disk or without having to load the whole *.tar.bz2 file into memory.

I know that with IO::Uncompress you can Bunzip2, but as far as I can see and tested, this would read the whole file into memory which is not possible with our large files. Example code below on Bunzipping (not including TAR).

use strict;
use warnings;
use IO::Uncompress::Bunzip2 qw(bunzip2 $Bunzip2Error) ;

my $filename = '/path/to/file/file1.xml.bz2';
open(my $fh, '<', $filename)
  or die "Could not open file '$filename' $!";

my $buffer ;
bunzip2 $filename => \$buffer
  or die "bunzip2 failed: $Bunzip2Error\n";

print STDOUT "$buffer\n";

Taking the TAR into account, there is also the Archive::Extract module which allows to read a .tar.bz2 file (type tbz) into an Extract Object, but again this would read the whole file into memory which is not possible with our ginormous files.

Because of my own research into the topic I think it is unlikely that it is possible to read a TAR of BZIP2s in a streaming fashion, i.e. line per line. I have no experience with compression, though, so maybe there is a way to reconstruct file lines given a number of data blocks.

Tl;dr: can you stream the file contents (line-per-line or similar) from a BZIP2 compressed TAR archive?


Solution

  • There is Compress::Raw::Bzip2 which allows you to decompress bzip2 input chunk by chunk, i.e. in a stream. But since .tar.bz2 is first a tar file which is then compressed with bzip2 you would need to first decompress all data up to the files location in the tar file before you have access to the data you want, i.e. there is no way to seek to the file without decompression everything up to this file. If you are fine with this you might be able to use Archive::Tar::Stream, i.e. feed the input from your bzip2 decoder into the streaming Tar parser. I've never used it myself but it looks like it was developed exactly for this kind of use case.

    If you have the option to change the format of the input files I would recommend to use a format which stores the compressed files in the archive (like ZIP does) instead of compressing the full archive (i.e. .tar.bz2). This way you could easily seek to a specific compressed file and decompress only this instead of everything up to this file.