Search code examples
pythonlinuxsedawkfasta

Merge two lines generated from contigs.fa


I have a file generated by assemblers. It looks like following.

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC

I want to merge the lines using python or linux sed command and want result in this way.

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

like every seqeunce consider as single line and Node name as other line.


Solution

  • You can use awk to do the job:

    awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'
    

    This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):

    awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
        { printf "%s", $0; flag=1 } END { if (flag) print "" }'
    

    @how to store it in a new file:

    awk < input_file > output_file '/^>/ { .... }'