Search code examples
perlfilecsvseekfileparsing

How to optimally move lines with a specific pattern at the top in a huge file using Perl?


I have a huge csv file of nearly 20k rows with below format:

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

I need to put 2 lines of pattern with same syntax (i.e 4th column) at the top. And then the rest of the lines will be there as is. That means first two lines with syntax as 'perl', then followed by 'java' , 'python' etc that way.

I have so far written this below code using seek and tell to make it optimized. However, it is not working as expected.

use strict;
use warnings;

open(FP, "+<mycsv.csv");

my %hash = ();
my $cur_pos;    


while(<FP>) {

    my $line = $_;
    chomp $line;
    #print "$line aaa\n";
    if($line =~ /^file\,tools,/) {next;}

    if($line =~ /^\w+\,\w+\,\w+,(\w+)\,.*$/) {
        my $type = $1;
        #print "type $type\n";

    if($hash{$type}->{count} < 2 ) {
        #print "--- here type = $type | lastpos = ", $hash{$type}->{lastpos} , "\n";
        $cur_pos = tell(FP);
        my $pos = tell(FP) - length($line); 
        if($hash{$type}->{lastpos} ) {

            my $lastpos = $hash{$type}->{lastpos};
            seek(FP, $lastpos, 1);
            print FP $line;
            seek(FP, $cur_pos, 1);
        } 

        $hash{$type}->{lastpos} = $pos;


    }
        if(exists $hash{$type} ) {
            $hash{$type}->{count} += 1;
        } else {
            $hash{$type}->{count} = 1;
        }


    }
}


close(FP);

The expected output should look like below:

 file,tools,edit,syntax,buffers
    a,b,c,perl,d
    a,e,c,perl,d
    a,w,c33,java,d
    wa,b,c33,java,d
    a,s,c,python,d1
    a,f,c,python,dd
    a,n,c,php,d3
    wa,b,c33,php,d
    d,r,hhh,cpp,d0
    d,buuu,hhh,cpp,d0
    d,m,hhh,c#,d0
    wa,b,c33,c#,d
    a,o,c,pdf,d3 
    a,yb,c,c,ddf 
    d44,b,hhh,nlp,d0
    a,be,c,js,d4  
    a,h,c,perl,dg   
    a,b,c,perl,dt   
    wa,b,c33,java,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,python,d
    wa,b,c33,php,d
    wa,b,c33,python,d
    wa,b,c33,perl,d
    wa,b,c33,php,d
    wa,b,c33,java,d
    wa,b,c33,python,d

Any help to make it work would be much appreciated.

Thanks.


Solution

  • I'd approach this by parsing the file to collect those first pairs of lines in a data structure and sending the other lines to a temp file. Once you've finished parsing the file, print out the pairs of lines from the data structure into your output file, and then add the temp file on to the end of the output file.

    sample code:

    use strict;
    use warnings;
    use feature ':5.16';
    
    my $infile = 'infile';
    my $outfile = 'outfile';
    my $tempfile = 'temp';
    my $quantity = 2;  # or whatever
    
    open my $in, '<', $infile or die 'Could not open infile: ' . $!;
    open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
    open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;
    
    my $hash = {};
    my @order;
    my $hdr;
    
    while ( <$in> ) {
      if ( $hdr ) {
        my @cols = split ",", $_;
        my $key = $cols[3];
    
        # have we seen this key before?
        if ( ! $hash->{$key} ) {
          push @order, $key;
          $hash->{$key} = [ $_ ];
        }
        elsif ( scalar @{$hash->{$key}} < $quantity ) {
          push @{$hash->{$key}}, $_;
        }
        else {
          print { $temp } $_;
        }
      }
      else {
        # the header line
        print { $out } $_;
        $hdr = $_;
      }
    }
    
    # print the collected twofers out into the tempfile
    for my $key ( @order ) {
      print { $out } @{$hash->{$key}};
    }
    close $out;
    close $temp;
    
    # concatenate the files
    system join ' ', ( 'cat', $tempfile, '>>', $outfile );
    

    If the paired lines don't have to be in the order that they appear in the source file, you can skip the @order stuff.