Search code examples
macosperlnewlinelarge-fileschomp

perl large IO bug on Mac but not Windows or Linux (adds newline can't be chomped)


I've tested my program on a dozen Windows machines, a half dozen Macs, and a Linux machine and it works without error on both the Windows and Linux but not the Macs. My program is designed to work with protein database files which are text files that range from 250MB to 10GB. I took 1/10th of the 250MB file to make a sample file for debugging purposes but found that the error did not occur with the smaller file.

I've narrowed down the bug to this section of code, in this section $tempFile, is the protein database file:

open(ps_file, "..".$slash."dataset".$slash.$tempFile) 
         or die "couldn't open $tempFile";
while(<ps_file>){
    chomp;


    my @curLine = split(/\t/, $_);
    my $filter = 1;
    if($taxon){
        chomp($curLine[2]);

        print "line2 ".$curLine[2].",\t".$taxR{$curLine[2]}."\n";

        $filter = $taxR{$curLine[2]};
    }
    if($filter){
        checkSeq(@curLine);
    }
}

This is a screenshot of the output of that print statement showing special characters:

output of that print statement showing special characters

This is what the output looks like on a Windows Machine:

output looks like on a Windows Machine

Here is an example of 1 line from the $tempFile

>sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA


Solution

  • The problem probably lies in inconsistent line-endings. If, as I suspect, trailing whitespace is not significant, you're better off removing that instead of chomping.

    Also note:

    • Bareword filehandles such as ps_file are package global variables that are subject to action at a distance, use lexical filehandles.

    • Use File::Spec or Path::Class to handle file paths in a platform independent way.

    • Include full file paths and error message if there is an error opening a file.

    • In

      chomp;
      
      my @curLine = split(/\t/, $_);
      my $filter = 1;
      if($taxon){
          chomp($curLine[2]);
      

    $curLine[2] comes from a string that was read in as a line and chomped. I don't see why you are chomping that again.

    Here's tidied up version of your code-snippet:

    use File::Spec::Functions qw( catfile );
    
    my $input_file = catfile('..', dataset => $tempFile);
    
    
    open my $ps_file, '<', $input_file
        or die "couldn't open '$input_file': $!";
    
    while (my $line = <$ps_file>) {
        $line =~ s/\s+\z//; # remove all trailing space
    
        my @curLine = split /\t/, $line;
    
        my $filter = 1;
        if ($taxon) {
            my $field = $curLine[2];
            $filter = $taxR{ $field };
    
            print join("\t", "line2 $field", $filter), "\n";
        }
        if ($filter) {
            checkSeq(@curLine);
        }
    }