Search code examples
bashperlawkbioinformaticsdna-sequence

Reverse string in specific fields with condition


I have this file:

m64071_220512_054244/12584899/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/128321000/ccs rev pet047-10055 ACGTGCGACCTTGTGA TTGAGGGTTCAAACGTGCGACCTTGTGA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA

I need to tr and reverse the fields $4 and $5 only if $2==rev

Expect:

m64071_220512_054244/12584899/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/128321000/ccs rev pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/132186699/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA
m64071_220512_054244/134874748/ccs fwd pet047-10055 TCACAAGGTCGCACGT TCACAAGGTCGCACGTTTGAACCCTCAA

I tried:

perl -lpe 'if(/rev/) {$rev=/rev/;next}; if ($rev) {$F[4,5]=~tr/ATGC/TACG/; $F[4,5]=reverse $F[4,5]; print "@F"}' file 

I also tried to go with Awk (Execute bash command inside awk and print command output)

awk '{
            if($2==rev)
        {
            cmd1="echo \047" $4 "\047 | rev | tr \047ATGC\047 \047TACG\047" 
            cmd2="echo \047" $5 "\047 | rev | tr \047ATGC\047 \047TACG\047"
            newVar1=((cmd1 | getline line) > 0 ? line : "failed") 
            newVar2=((cmd2 | getline line) > 0 ? line : "failed")
            close(cmd)
            print $1, $2, $3, newVar1, newVar2
        }
        else {print}
}' file

Solution

  • To follow the question's attempt:

    perl -w -lanE'
        if ($F[1] eq "rev") { 
            for (@F[3,4]) { tr/ATGC/TACG/; $_ = reverse $_ } 
        } 
        say "@F"
    ' file
    

    Can put this on one line, I spread it out for readability (or copy-paste it as is into most shells). Or, put it in a program file of course, specially if there is more to do.

    The code in the block of that loop over @F[3,4] elements runs on each element of the array and modifies it in place (since $_ is just an alias to array elements), first with tr then by (reversing and) assigning. All that can also be written as

    $_ = reverse tr/ATGC/TACG/r  for @F[3,4];
    

    The r modifier makes tr return the changed string, which is then reverse-ed and then assigned back to the currently processed array element, via the $_ alias for it

    Comments on the code in the question

    • To have it break up the input string into the @F array ("autosplit") need -a flag

    • Since you are explicitly printing what is needed use -n flag, not -p

    • Fields 4 and 5 in the line are array elements 3 and 4

    • I assume that by $F[4,5] you mean the two array elements (which should be 3,4). That, then, should be @F[3,4] -- and with -w flag, for warnings, we'd hear about it

    • More importantly, we cannot bind a regular expression or a tr pattern to a list, but only to a single scalar. In order to apply that tr to multiple items need to iterate over them, like above


    With @ary =~ /.../ the LHS is first evaluated so we get something like

    $ary[0], ..., $ary[-1] =~ /.../;
    

    All terms up to the last one are evaluated in void context and discarded (see the comma operator for the logic of it), one by one and with a warning (if warnings are on!), and then $ary[-1] =~ /.../ is evaluated and its return returned from the whole expression.

    The same goes for LIST =~ /.../; (with parens around LIST or not)