Search code examples
bashawkgrepfasta

Pull out a range of data from unique character to unique character using grep or awk


I have a moderately large fasta format file that has a complex header. I need to pull a sequence out based on a value (an 8 digit number) from another file. I can get the sequence out using 'grep -20 "value" fasta.file'. Some of the sequences are very large and I often have to adjust the number of lines in order to get the whole sequence. I then have to copy and paste into another file. Right now, I have too many values (1000) to do this manually. The tools that I have found to do this haven't worked so far...

The fasta format file looks like this:

>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA
CCACAACATTGGGAGTCTGAGGTCCAC

If I use the command grep -4 "17818557" fasta.fa I get:

ATTGCGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818555; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134299894; Stop=134299978; Strand=+; Length=84;
GGATCATTGATGACCAGAAAAAAATCATCTCGGAGTCCTCTGAGACATCCATGATGACCA
CAACATTGGGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA

grep -4 grabs four lines above and below. What I need to do is use the numerical query and pull out just the sequence data below the fasta header (>). It would be nice to collect the sequence below the fasta header to next fasta header, ie from > to >.

I have tried some of the UCSC tools 'faSomeRecord' and some perl scripts. They haven't worked with the numerical query in list file or on the command line with and without the 'transcript_cluster:RaGene-2_0-st-v1:' addition. I am thinking that it is the colons or because the header includes positions and lengths which are variable.

Any comments or help is greatly appreciated!

EDIT 30July14

Thanks to the help I received here. I was able to get the data from one file to another using this bash script:

#!/usr/bin/bash

filename='21Feb14_list.txt'
filelines=`cat $filename`

for i in $filelines ; do

        awk '/transcript/ && f==1 {f=0;next} /'"$i"'/ {f=1} f==1{print $1}' RaGene-2_0-st-v1.rn4.transcript_cluster.fa

done

This pulls out the sequence, but it truncates the data to the wildcard value. Is there a way to modify this so that I can get the entire header?

example output:

>transcript_cluster:RaGene-2_0-st-v1:17719499;
ATGCCTGAGCCTTCGAAATCTGCACCAGCTCCTAAGAAGGGCTCTAAGAAAGCTATCTCT
AAAGCTCAGAAAAAGGATGGCAAGAAGCGCAAGCGTAGCCGCAAGGAGAGCTATTCCGTG
TACGTGTACAAGGTGCTGAAGCAAGTGCACCCGGACACCGGCATCTCTTCCAAGGCCATG
GGCATCATGAACTCGTTCGTGAACGACATCTTCGAGCGCATCGCGGGCGAGGCGTCGCGC
CTGGCGCATTACAACAAGCGCTCGACCATCACGTCCCGGGAGATCCAGACCGCCGTGCGC
CTGCTGCTGCCGGGGGAGCTGGCCAAGCACGCGGTGTCGGAAGGCACCAAGGCGGTCACC
AAGTACACCAGCTCCAAGTG
>transcript_cluster:RaGene-2_0-st-v1:17623679;

Thanks again!!


Solution

  • $ awk '/transcript/ {f=0} /17818557/ {f=1} f==1{print}' fasta
    >transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
    GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
    ATTGAGAGTCTGAGGTCCAC
    

    How it works:

    The code uses a flag, called f, to decide if a line should be printed. Taking each command, one by one:

    • /transcript/ {f=0}

      If "transcript" appears in the line, indicating a header, we set the flag to 0.

    • /17818557/ {f=1}

      If the line contains our key, 17818557, we set the flag to 1

    • f==1{print}

      If the flag is 1, print the line.