Search code examples
bashsedfasta

how can I reformat sequences (several lines) in a fasta file to single line?


Input "file.fasta" (note, this is a sample .... in fasta file, the sequences may have more than three lines)

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGtt
aGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTG
GTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaa
cCCCCCGACGACGACTCACGA

Expected output

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

my effort: sed command

sed ':a;N;$!ba;s/\([actgACGT]\)\n\([actgACGT]\)/\1\2/g' file.fasta

my wrong output:

>chr1:117223140-117223856 TAG:GTGGGGTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCTACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

The regular expression for header (lines whose first letter is ">") is "^>.*$", but I do not know how to include in sed command

thanks in advance


Solution

  • This might work for you (GNU sed):

    sed ':a;N;/>/!s/\n//;ta;P;D' file
    

    Look at two lines and if either does not contains a > delete the newline between them and repeat. If either of the lines does contain a > then print and delete the first of them and then repeat.