Search code examples
bashawksedgrepfasta

bash- searching for a string in a file and returning all the matching positions


I have a fasta file_imagine as a txt file in which even lines are sequences of characters and odd lines are sequence id's_ I would like to search for a string in sequences and get the position for matching substrings as well as their ids. Example: Input:

>111
AACCTTGG
>222
CTTCCAACC
>333
AATCG

search for "CC" . output:

3 111
4 8 222

Solution

  • $ awk -F'CC' 'NR%2==1{id=substr($0,2);next} NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}' file
    3 111
    4 8 222
    

    Explanation:

    • -F'CC'

      awk breaks input lines into fields. We instruct it to use the sequence of interest, CC in this example, as the field separator.

    • NR%2==1{id=substr($0,2);next}

      On odd number lines, we save the id to variable id. The assumption is that the first character is > and the id is whatever comes after. Having captured the id, we instruct awk to skip the remaining commands and start over with the next line.

    • NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}

      If awk finds only one field on an input line, NF==1, that means that there were no field separators found and we ignore those lines.

      For the rest of the lines, we calculate the positions of each match in x and then save each value of x found in the string b.

      Finally, we print the match locations, b, and the id.