Search code examples
unixmultilinedna-sequencepcregrep

Pcregrep duplicating matching multi-line patterns?


I have a text file which contains 2 lines of a sample DNA sequence, usingpcregrep, I want to find patterns matching "CCC" especially the patterns that span through multiple lines (see end of line 1 to the beginning of line 2 in test.txt below) .

test.txt:

AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCACCCGCCAGAGCAGAAAAUAUUGGGGCAGCGCC
CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUUCCCGACAUUGGCUGAAUA

Using Command:

pcregrep -M --color "C[\n]?C[\n]?C" test.txt

Returns:

AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCA**CCC**GCCAGAGCAGAAAAUAUUGGGGCAGCG**CC**

**C**CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUU**CCC**GACAUUGGCUGAAUA

It seems to correctly highlight the 2 C's in line 1, however, it highlights the first C in line 2 and then proceeds to print out the second line entirely; giving me a duplication of C.

What am I doing wrong here and how can I avoid the duplication of 'C' in line 2?


Solution

  • Try with this:

    pcregrep -M --color "(?<!C)(C\RCC|CC\RC)(?!C)" test.txt
    

    I'm assuming that you want to find exactly 3 Cs and no more, and that more than 3C is possible. If that is not possible, or you don't care about matching more than 3C's, you may use this simpler regex instead:

    pcregrep -M --color "C\RCC|CC\RC" test.txt
    

    Explanation:

    (?<!C)   # Negative lookbehind: Don't match if there's a C before the match
    (              # One of these:
          C\RCC    #   C + any kind of new line + CC
        | CC\RC    #  CC + any kind of new line + C
    )
    (?!C)    # Negative lookahead: Don't match it there's a C after the match
    

    See demo here.