Search code examples
regexawksedfastareformat

How can i remove a line only if it is followed by a line that starts with the same character?


I need some help with sed or awks.

How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?

Example I have this:

>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC

I want to get this:

>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC

Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.

Thank you so much!


Solution

  • If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:

    awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
    

    If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.

    The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:

    awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'