Search code examples
bashbioinformaticsfasta

append modified ID to fasta file ID


I have a file that looks like this:

>1_CCACT_1/1
CCATCATTGGCGTCTACA
>2_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC

And I want to make it look like this:

>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC

Where the first 1 is original, followed by a #, then the second number is from here (in bold):

5_ATATC_1

Followed by a #, and its followed by this barcode (in bold):

5_ATATC_1

I'm using the last entry just as an example. I have some messy sed scripts that can produce the desired header (sort of) but I can't figure out how to append them back to the original headers. You can't assume that the second number will always be a 1, but you can assume that the order of the file won't change. Open to solutions in any programming language, though I've only tried in bash.


Solution

  • A couple sed ideas using capture groups:

    sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/'           fasta.dat
    sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
    

    Both of these generate:

    >1_CCACT_1/1#1#CCACT
    CCATCATTGGCGTCTACA
    >2_ATATC_1/1#1#ATATC
    ATATGAAGGCTGTGAAGCAAAGCGTC
    >3_GCTAT_1/1#1#GCTAT
    CAAACCCATTAATTTCACATCCGTCC
    >4_GTATG_1/1#1#GTATG
    TAAGCCAGGTTGGTTTCTATCTTT
    >5_ATATC_1/1#1#ATATC
    ATATGAAGGCTGTGAAGCAAAGCGTC
    

    Once satisfied with the result add the -i flag to overwrite the input file:

    sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/'           fasta.dat
    sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
    
    $ cat fasta.dat
    >1_CCACT_1/1#1#CCACT
    CCATCATTGGCGTCTACA
    >2_ATATC_1/1#1#ATATC
    ATATGAAGGCTGTGAAGCAAAGCGTC
    >3_GCTAT_1/1#1#GCTAT
    CAAACCCATTAATTTCACATCCGTCC
    >4_GTATG_1/1#1#GTATG
    TAAGCCAGGTTGGTTTCTATCTTT
    >5_ATATC_1/1#1#ATATC
    ATATGAAGGCTGTGAAGCAAAGCGTC