I have a fasta file_imagine as a txt file in which even lines are sequences of characters and odd lines are sequence id's_ I would like to search for a string in sequences and get the position for matching substrings as well as their ids. Example: Input:
>111
AACCTTGG
>222
CTTCCAACC
>333
AATCG
search for "CC" . output:
3 111
4 8 222
$ awk -F'CC' 'NR%2==1{id=substr($0,2);next} NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}' file
3 111
4 8 222
Explanation:
-F'CC'
awk breaks input lines into fields. We instruct it to use the sequence of interest, CC
in this example, as the field separator.
NR%2==1{id=substr($0,2);next}
On odd number lines, we save the id to variable id
. The assumption is that the first character is >
and the id is whatever comes after. Having captured the id, we instruct awk to skip the remaining commands and start over with the next
line.
NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}
If awk finds only one field on an input line, NF==1
, that means that there were no field separators found and we ignore those lines.
For the rest of the lines, we calculate the positions of each match in x
and then save each value of x
found in the string b
.
Finally, we print the match locations, b
, and the id
.