Search code examples
linuxawkgrepfasta

Extract lines containing two patterns


I have a file which contains several lines as follows:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>

I want to extract all lines containing both and including the header lines.

I have tried using grep, but it only extracts the sequence lines but not the header lines.

grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta

How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.

Expected output:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>

Solution

  • You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files

    bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
           '($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
    

    If you want seq1 at the beginning and seq2 at the end, you can change it into:

    bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
           '($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
    

    This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.

    If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:

    • read a single record at a time (this assumes no > in the header, except the first character)
    • extract the header from the record and store it in name (not really needed)
    • merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
    • if a match is found, print the record.

    The following awk does the requested:

    awk -v seq1="pattern1" -v seq2="pattern2" \
        'BEGIN{RS=">"; ORS=""; FS="\n"}
         { seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
         (seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
    

    If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)

    awk -v seq1="pattern1" -v seq2="pattern2" \
        '/^>/ && (seq ~ seq1 && seq ~ seq2) {
             print name
             for(i=0;i<n;i++) print aseq[i]
         }
         /^>/ { seq=""; delete aseq; n=0; name=$0; next }
         { aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
         END { if (seq ~ seq1 && seq ~ seq2) {
                  print name
                  for(i=0;i<n;i++) print aseq[i]
                }
         }' file.fasta
    

    note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))


    Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.