I have a text file which contains 2 lines of a sample DNA sequence, usingpcregrep
, I want to find patterns matching "CCC" especially the patterns that span through multiple lines (see end of line 1 to the beginning of line 2 in test.txt below) .
test.txt:
AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCACCCGCCAGAGCAGAAAAUAUUGGGGCAGCGCC
CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUUCCCGACAUUGGCUGAAUA
Using Command:
pcregrep -M --color "C[\n]?C[\n]?C" test.txt
Returns:
AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCA**CCC**GCCAGAGCAGAAAAUAUUGGGGCAGCG**CC**
**C**CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUU**CCC**GACAUUGGCUGAAUA
It seems to correctly highlight the 2 C's in line 1, however, it highlights the first C in line 2 and then proceeds to print out the second line entirely; giving me a duplication of C.
What am I doing wrong here and how can I avoid the duplication of 'C' in line 2?
Try with this:
pcregrep -M --color "(?<!C)(C\RCC|CC\RC)(?!C)" test.txt
I'm assuming that you want to find exactly 3 C
s and no more, and that more than 3C
is possible. If that is not possible, or you don't care about matching more than 3C's, you may use this simpler regex instead:
pcregrep -M --color "C\RCC|CC\RC" test.txt
Explanation:
(?<!C) # Negative lookbehind: Don't match if there's a C before the match
( # One of these:
C\RCC # C + any kind of new line + CC
| CC\RC # CC + any kind of new line + C
)
(?!C) # Negative lookahead: Don't match it there's a C after the match
See demo here.