Search code examples
perltextbinaryfiles

Perl: Split a mixed text and binary file after a specific string


I have files that start with unix-delimited text lines, then switch to binary. The text portion ends with a specific string followed by newline. After that it is binary.

I need to write the text portion into one file, then write the remainder of the binary data into another file. Here's what I have so far, but I'm stuck on how to switch to binary and write the remainder.

#!/usr/bin/perl

use 5.010;
use strict; 
use warnings;


my ($inputfilename, $outtextfilename, $outbinfilename) = @ARGV;
open(my $in, '<:encoding(UTF-8)', $inputfilename)
  or die "Could not open file '$inputfilename' $!";

open my $outtext, '>', $outtextfilename or die;

my $outbin;
open $outbin, '>', $outbinfilename or die;
binmode $outbin;


while (my $aline = <$in>) {
  chomp $aline;
  if($aline =~ /\<\/FileSystem\>/) {   # a match indicates the end of the text portion - the rest is binary
    print $outtext "$aline\n";  # last line of the text portion
    print  "$aline\n";  # last line of the text portion
    close ($outtext); 

    binmode $in;  # change input file to binary? 
    # what do I do here to copy all remaining bytes in file as binary to $outbin??
    die;
    } else {
    print $outtext  "$aline\n";   # a line of the text portion
    print "$aline\n";   # a line of the text portion
    }
}

close ($in);
close ($outbin); 

Edit - final code:

#!/usr/bin/perl
use 5.010;
use strict; 
use warnings;


my ($inputfilename, $outtextfilename, $outbinfilename) = @ARGV;

open(my $in, '<', $inputfilename)
  or die "Could not open file '$inputfilename' $!";

open my $outtext, '>', $outtextfilename or die;

my $outbin;
open $outbin, '>', $outbinfilename or die;
binmode $outbin;


    print "Starting File\n";
while (my $aline = <$in>) {
  chomp $aline;
  if($aline =~ /\<\/FileSystem\>/) {   # a match indicates the end of the text portion - the rest is binary
    print $outtext "$aline\n";  # last line of the text portion
    print  "$aline\n";  # last line of the text portion
    close ($outtext); 

    binmode $in;  # change input file to binary

    my $cont = '';
    print "processing binary portion";
    while (1) {
    my $success = read $in, $cont, 1000000, length($cont);
    die $! if not defined $success;
    last if not $success;
    print ".";
    }
    close ($in);
    print $outbin $cont;
    print "\nDone\n";
    close $outbin;
    last;

    } else {
    print $outtext  "$aline\n";   # a line of the text portion
    print "$aline\n";   # a line of the text portion
    }
}

Solution

  • The easiest way is probably to use binary I/O for everything. That way we don't have to worry about switching file modes halfway through, and on unix there is no difference between text and binary mode anyway (except when it comes to encodings, but here we just want to copy bytes unchanged).

    Depending on how big the plain text portion of the file is, we could either process it line by line or read it all into memory at once.

    #!/usr/bin/perl
    use strict; 
    use warnings;
    
    my ($inputfilename, $outtextfilename, $outbinfilename) = @ARGV;
    
    open my $in_fh, '<:raw', $inputfilename
        or die "$0: can't open $inputfilename for reading: $!\n";
    
    open my $out_txt_fh, '>:raw', $outtextfilename
        or die "$0: can't open $outtextfilename for writing: $!\n";
    
    open my $out_bin_fh, '>:raw', $outbinfilename
        or die "$0: can't open $outbinfilename for writing: $!\n";
    
    # process text part
    while (my $line = readline $in_fh) {
        print $out_txt_fh $line;
        last if $line =~ m{</FileSystem>};
    }
    
    # process binary part
    while (read $in_fh, my $buffer, 4096) {
        print $out_bin_fh $buffer;
    }
    

    This version of the code processes the text part line by line and the binary part in chunks of 4096 bytes (not taking internal buffering into account).

    Alternatively, if the character sequence marking the end of the text part is exactly "</FileSystem>\n", we can be a bit cheeky:

    # process text part
    {
        local $/ = "</FileSystem>\n";
        if (my $line = readline $in_fh) {
            print $out_txt_fh $line;
        }
    }
    

    We temporarily switch the end-of-line marker from "\n" to "</FileSystem>\n" and read a single "line", which encompasses all of the text part. This assumes the text part is small enough to comfortably fit into memory. The rest of the script is the same.