Search code examples
perlshellbioinformaticsfasta

perl sequence extraction loop


I have an existing perl one-liner (from the Edwards lab) that works wonderfully to read a text file (named ids.file) that contains one column of IDs and searches a second, specially formatted text file (named fasta.file in this example - in "fasta" format for those who know bioinformatics) and returns sequences that match the ID from the first file. I was hoping to expand this script to do two additional things:

  1. The current perl one-liner only seems to work if the ids.file contains one column of data. I would like it to work on a file that contains two columns (separated by spaces), and act on the second column of data (well, really any column of data, but I assume that it will be obvious enough to adapt it if someone can give an example using a second column)
  2. I would like to append the any results returned from the output of the search to a third column, instead of just to a new file.

If someone is kind enough to offer an example but only has time or inclination to work on one of these, I would prefer that you try to solve #2 - I have come close to solving #1 with a for loop that uses awk to only use the Perl code on the second column - I haven't gotten it yet, but am close, so #2 seems like the harder one to me.

The perl one liner is as follows:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

I appreciate any help you can give!


Solution

  • Not quite sure but will this do?

    perl -ne 'chomp; s/^>(\S+).*/$c=$i{$1}/e; print if $c; 
        $i{(/^\S*\s(\S*)$/)[0]}="$_ " if @ARGV' 
      ids.file fasta.file