Search code examples
unixgrepline-breaksdna-sequence

Using grep to search DNA sequence files


I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:

Using

$ grep -i -n "GACGGCT" grep3.txt 

To search the file grep3.txt (I put the target 'GACGGCT's in double stars)

GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

Returns

3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC

So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.

How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?


Solution

  • pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
    1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
    2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
    CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
    6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC