Search code examples
awkheaderfasta

using awk to find pattern if line starts with ">" and add at the end of it the number of occurences of the pattern


I have been struggling with awk to figure out a way to find identical patterns and add a tag at the end of them showing how many times they are present in the file. For example, if Spiroplasma_culicicola occurs 7 times, then next to the first occurrence, it should write Spiroplasma_culicicola_1, next to the second occurrence Spiroplasma_culicicola_2 next to the third occurrence Spiroplasma_culicicola_3 etc etc

However I have a fasta file that looks like this:

>Spiroplasma_taiwanense
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...

so I would like to add the "tag", the number showing occurences only next to the headers! therefore the above file should look like:

>Spiroplasma_taiwanense_1
GKGVKYKNEKIIRKEGKAAGKMTTDVIADMLTRIRNANQRFHKEVVIPGSKVKLEIANIL
KKEGFIEDFKVADDFKKDITISLKYRGKTRVIKGLKRISKPGLRVYSHATEIPQVLNGLG
IAIVSTSHGIMTDKEARQQNAGGEVLAFVW
>Spiroplasma_diminutum_1
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
KSKILRGDVVKVIAGSHKGKIGPVVKLSKDKKRVYVEGIVAIK-HAKPSQTDQEGGIREI
PAGVDISNVSLVDPKVKDSATRVGYKIADGKKVRIAKKSGSEVK-MIQNESRLKVADNSG
>Spiroplasma_diminutum_2
NRLEKQYKEKIVPELFKEKQYKSIMQVPKITKVVINMGIGDAVQDTKKLDDAVLELQQIT
GQKPLVTKAKKSLAVFKLREGMPIGAKVTLRGKRMYEFLDKLISVALPRVRDFRGVPKTS
FDKQGNYTMGIKEQIIFPEIDYDKVKKVRGMDITIVTTANQKDEAFSLLQKMGMPFVKMN
...

Based on a previous answered question I figured that I should use awk, with sth like this: awk '$1 ~ /^>/ {gsub(" ", "", $0); a[$0]++; print $0"_"a[$0]}'

(code stolen from here:find the number of occurences and add it next to the pattern)

However I cant find a way to save the changes in the file (for example like sed with -i) and I cant redirect it to a new file cause then it simply prints/saves the headers.

Any ideas?

thanks P


Solution

  • It seems the problem is that you don't understand the code you have found elsewhere:

    awk '$1 ~ /^>/ {gsub(" ", "", $0); a[$0]++; print $0"_"a[$0]}'
    

    By the looks of things, it performs the substitution that you want and prints the lines that start with >.

    So the missing part is to print the rest of the lines without making any modification.

    You could do it like this:

    awk '$1 ~ /^>/ { gsub(" ", "", $0); a[$0]++; $0 = $0"_"a[$0] } { print }'
    

    That is, change the print to an assignment in the first block and add an unconditional second block which always prints everything.

    The code can be further simplified, by combining the increment with the assignment and changing { print } to the common shorthand (just a 1 condition with the default action, print).

    As mentioned in the comments, the call to gsub can be improved by passing a regex literal as the first argument, as opposed to a string which must be converted to a regex before use. It can also be shortened by removing the final argument $0 which is the default.

    awk '$1 ~ /^>/ { gsub(/ /, ""); $0 = $0 "_" ++a[$0] } 1'
    

    To overwrite the original file, just redirect to a temporary file then overwrite the original:

    awk '...' input > tmp && mv tmp input
    

    Or with GNU awk, as mentioned in the comments:

    awk -i inplace '...' input