Search code examples
sedbioinformaticsfasta

Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)


I have a file containing the following text:

>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC

If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.

Expected output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

I have tried sed -r '/^.{,5}$/d', but it also deletes the lines with ">".


Solution

  • With a GNU sed, you can use

    sed -E '/>/N;/\n[^>].{0,4}$/d'
    

    Details:

    • />/ - finds lines with > (if it must be at the start, add ^ before >)
    • N - reads the line and appends it to the pattern space with a leading newline
    • \n[^>].{0,4}$ - a newline, a char other than a > (as the first char should not be >) and then zero to four chars till end of the string
    • d removes the value in pattern space.

    See the online demo:

    #!/bin/bash
    s='>seq1
    GAAAT
    >seq2
    CATCTCGGGA
    >seq3
    GAC
    >seq4
    ATTCCGTGCC'
    sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"
    

    Output:

    >seq2
    CATCTCGGGA
    >seq4
    ATTCCGTGCC