Search code examples
bashbioinformaticsfasta

Writing a single sequence from a file in FASTA format


Given a file in FASTA format (your.file), for example:

>Code1234_length1
ABCEDLKSDJFABCEDLKSDJFABCEDLKSDJFABCEDLKSDJFABCEDLKSDJF
>Code1335_length2
AJDHIEUNAJDHIEUNAJDHIEUNAJDHIEUNAJDHIEUNAJDHIEUN

But the content after >Code1234_length1 is unknown (in this example it was known just for a reproducible sample). I would like to get the unknown contents after >Code1234_length1, including the string >Code1234_length1 but before the next > and output it in a new file. i.e.

>Code1234_length1
ABCEDLKSDJFABCEDLKSDJFABCEDLKSDJFABCEDLKSDJFABCEDLKSDJF

How could this be done? Thank you.


Solution

  • If awk is your option, would you please try:

    awk '
        /^>Code1234_length1/ {f = 1; print; next}   # if the keyword is found, set the flag,
                                                    #    print the line and continue with the next line
        f {                                         # if the flag is set
            if (/^>/) f = 0                         #    if next ">" is found, reset the flag
            else print                              #    otherwise print the line
        }
    ' your.file > new.file
    

    It works even if multiple lines follow the >Code1234_length1 line.