Extract lines containing two patterns

I have a file which contains several lines as follows:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>

I want to extract all lines containing both and including the header lines.

I have tried using grep, but it only extracts the sequence lines but not the header lines.

grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta

How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.

Expected output:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>

Solution

You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
       '($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta

If you want seq1 at the beginning and seq2 at the end, you can change it into:

bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
       '($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta

This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.

If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:

read a single record at a time (this assumes no > in the header, except the first character)
extract the header from the record and store it in name (not really needed)
merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
if a match is found, print the record.

The following awk does the requested:

awk -v seq1="pattern1" -v seq2="pattern2" \
    'BEGIN{RS=">"; ORS=""; FS="\n"}
     { seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
     (seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta

If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)

awk -v seq1="pattern1" -v seq2="pattern2" \
    '/^>/ && (seq ~ seq1 && seq ~ seq2) {
         print name
         for(i=0;i<n;i++) print aseq[i]
     }
     /^>/ { seq=""; delete aseq; n=0; name=$0; next }
     { aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
     END { if (seq ~ seq1 && seq ~ seq2) {
              print name
              for(i=0;i<n;i++) print aseq[i]
            }
     }' file.fasta

note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.