I am trying to manipulate a file which has around 1 million rows. Below is my example input:
chr1 GeneA E1 -
chr1 GeneA E2 -
chr1 GeneA E3 -
chr1 GeneB E1 +
chr1 GeneB E2 +
chr1 GeneB E3 +
chr1 GeneB E4 +
chr1 GeneC E1 -
chr1 GeneC E2 -
chr2 GeneD E1 +
I want to reverse the order of column 3 if the column 4 has "-" sign and the rows have same column 1 and 2 values. Example output:
chr1 GeneA E1 - E3
chr1 GeneA E2 - E2
chr1 GeneA E3 - E1
chr1 GeneB E1 + E1
chr1 GeneB E2 + E2
chr1 GeneB E3 + E3
chr1 GeneB E4 + E4
chr1 GeneC E1 - E2
chr1 GeneC E2 - E1
chr2 GeneD E1 + E1
I am trying to code the following steps:
Take first row and store in arrayA.
Take the second row.
If column 1 and 2 has same value as in previous row and column 5 has "-" sign, then push second row in arrayA else print the whole arrayA with reverse column 3.
Here is what I tried so far:
#!/usr/bin/perl
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
my @previous=split(/\t/,<$first>);
while (<$first>) {
my @current=split /\t/;
if ($current[1] eq $previous[1] && $current[0] eq $previous[0] && $current[3] eq "-"){
push @previous,[@current];
}
else{
foreach (@previous) {
print "$_","\t",reverse $previous[0][2];
}
@previous=@current;
}
}
It is giving out the same as input file. Could you please help to make this code work?
Always include use strict;
and use warnings;
at the top of EVERY script.
To do this project, you just need to keep a buffer of lines to later process once you see a change in your first two fields. This is a fairly common programming construct, especially when you're working with data that needs to be grouped and processed in some way:
use strict;
use warnings;
my @buffer;
while (<DATA>) {
chomp;
my @data = split ' ';
if (@buffer && ($data[0] ne $buffer[0][0] || $data[1] ne $buffer[0][1])) {
process_buffer(@buffer);
@buffer = ();
}
push @buffer, [@data, $_];
}
process_buffer(@buffer);
sub process_buffer {
my @buffer = @_;
my @col3 = map $_->[2], @buffer;
@col3 = reverse @col3 if $buffer[0][3] eq '-';
for my $i (0..$#buffer) {
print $buffer[$i][-1], " ", $col3[$i], "\n";
}
}
__DATA__
chr1 GeneA E1 -
chr1 GeneA E2 -
chr1 GeneA E3 -
chr1 GeneB E1 +
chr1 GeneB E2 +
chr1 GeneB E3 +
chr1 GeneB E4 +
chr1 GeneC E1 -
chr1 GeneC E2 -
chr2 GeneD E1 +
Outputs:
chr1 GeneA E1 - E3
chr1 GeneA E2 - E2
chr1 GeneA E3 - E1
chr1 GeneB E1 + E1
chr1 GeneB E2 + E2
chr1 GeneB E3 + E3
chr1 GeneB E4 + E4
chr1 GeneC E1 - E2
chr1 GeneC E2 - E1
chr2 GeneD E1 + E1