Search code examples
regexshellparsingfasta

Turning multi-fasta file into set of single-line sequences


I have a multi-fasta sequence file (there is a newline character at the end of each line):

>M3559
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

I would like to turn it into a file where each sequence is in a single line, with sequence name followed by tab:

>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCATTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC

I got to the point where I removed all the newline characters by simple:

awk 1 ORS='' test.txt

But I now need to place a newline character in the beginning of each sequence name (so substitute > with \n>)

tr ">" "\n"

(although this removes the >, and ideally I would like to keep it, but it's not a big deal)

and add a \t after the sequence name, which I can capture with a regular expression.

^>M[0-9]{4}

And this is this last bit I struggle with - how do I add a character after a regex-ed string in a file? Suggestions will be greatly appreciated :-)

yot

UPDATE: below I paste the output of the various commands suggested by others on my test input file.

UPDATE 2: Fredrik's answer works if you use gnu sed instead of the default sed on a Mac. Please see my comment under Fredrik's answer.

Running:

awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file

on my input produces:

>M3559
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M9171
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA
>CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>>M4692
>GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
>TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG
>CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
>ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA
>CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG

Running:

echo $(cat test.txt) | sed 's/>/\n>/2g' | sed 's/ //2g' | sed 's/ /\t/g'

produces nothing (no output).

I am not running paste -d " " - - - - < input as numbers of line for each sequence is different in my input.

But running:

awk 'NR%4{printf $0" ";next;}1' input

Produces:

>M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG 
>M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA
TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTA CTAAAGTGTGTTAGTTAATTAATGCTTGTAGGACATAATAATAACAATTG
>M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTA CCAAAATGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTG

and then running sed 's/ \+/ /' | tr -d ' ' does not help...


Solution

  • If the input is as well formated as above, you can use paste

    $ paste -d " " - - - - < input
    >M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
    >M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
    >M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
    

    or awk:

    $ awk 'NR%4{printf $0" ";next;}1' input
    >M3559 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
    >M9171 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTACCTC
    >M4692 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCT:CCATGCA TTTGG:TAT:TTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACG CTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC
    

    To remove spaces and to have a tab after the id, pipe everything to

    sed 's/ \+/ /' | tr -d ' '