Search code examples
perl

Reverse the order of column based on other columns


I am trying to manipulate a file which has around 1 million rows. Below is my example input:

chr1  GeneA  E1  -
chr1  GeneA  E2  -
chr1  GeneA  E3  -
chr1  GeneB  E1  +
chr1  GeneB  E2  +
chr1  GeneB  E3  +
chr1  GeneB  E4  +
chr1  GeneC  E1  -
chr1  GeneC  E2  -
chr2  GeneD  E1  +

I want to reverse the order of column 3 if the column 4 has "-" sign and the rows have same column 1 and 2 values. Example output:

chr1  GeneA  E1  -  E3
chr1  GeneA  E2  -  E2
chr1  GeneA  E3  -  E1
chr1  GeneB  E1  +  E1
chr1  GeneB  E2  +  E2
chr1  GeneB  E3  +  E3
chr1  GeneB  E4  +  E4
chr1  GeneC  E1  -  E2
chr1  GeneC  E2  -  E1
chr2  GeneD  E1  +  E1

I am trying to code the following steps:

  1. Take first row and store in arrayA.

  2. Take the second row.

  3. If column 1 and 2 has same value as in previous row and column 5 has "-" sign, then push second row in arrayA else print the whole arrayA with reverse column 3.

Here is what I tried so far:

#!/usr/bin/perl
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
my @previous=split(/\t/,<$first>);

while (<$first>) {
    my @current=split /\t/;
    if ($current[1] eq $previous[1] && $current[0] eq $previous[0] && $current[3] eq "-"){
        push @previous,[@current];
    }
    else{
        foreach (@previous) {
            print "$_","\t",reverse $previous[0][2];
        }
        @previous=@current;
    }
}

It is giving out the same as input file. Could you please help to make this code work?


Solution

  • Always include use strict; and use warnings; at the top of EVERY script.

    To do this project, you just need to keep a buffer of lines to later process once you see a change in your first two fields. This is a fairly common programming construct, especially when you're working with data that needs to be grouped and processed in some way:

    use strict;
    use warnings;
    
    my @buffer;
    
    while (<DATA>) {
        chomp;
        my @data = split ' ';
        if (@buffer && ($data[0] ne $buffer[0][0] || $data[1] ne $buffer[0][1])) {
            process_buffer(@buffer);
            @buffer = ();
        }
    
        push @buffer, [@data, $_];
    }
    
    process_buffer(@buffer);
    
    sub process_buffer {
        my @buffer = @_;
        my @col3 = map $_->[2], @buffer;
        @col3 = reverse @col3 if $buffer[0][3] eq '-';
        for my $i (0..$#buffer) {
            print $buffer[$i][-1], "  ", $col3[$i], "\n";
        }
    }
    
    __DATA__
    chr1  GeneA  E1  -
    chr1  GeneA  E2  -
    chr1  GeneA  E3  -
    chr1  GeneB  E1  +
    chr1  GeneB  E2  +
    chr1  GeneB  E3  +
    chr1  GeneB  E4  +
    chr1  GeneC  E1  -
    chr1  GeneC  E2  -
    chr2  GeneD  E1  +
    

    Outputs:

    chr1  GeneA  E1  -  E3
    chr1  GeneA  E2  -  E2
    chr1  GeneA  E3  -  E1
    chr1  GeneB  E1  +  E1
    chr1  GeneB  E2  +  E2
    chr1  GeneB  E3  +  E3
    chr1  GeneB  E4  +  E4
    chr1  GeneC  E1  -  E2
    chr1  GeneC  E2  -  E1
    chr2  GeneD  E1  +  E1