I have a (perhaps complicated) RegExp question. A tool that generates files from models says my model uses a name twice, but does not say which name this is. I know that all the names in question start with "CK_", followed by some non-whitespace. I prepared this test-file:
CK_123abc
foo
CK_abc
CK_123abc
CK_199
bar
CK_177
bar
CK_188
As you can see "CK_123abc" appears twice. I want to catch all those (if there are more) with a RegExp. I got this one so far: (CK_\S*).+\1
This works fine and matches the following text:
CK_123abc
foo
CK_abc
CK_123abc
but it also matches
CK_199
bar
CK_177
bar
CK_1
The 2nd, unwanted match is for CK_1. As my real document is full of these "half-string"-matches, I can't find my real match (like the 1st one here) in the data. I think that (CK_\S*) for some reason is not greedy - or that the whole regex is greedy. For my use case to work, (CK_\S*) has to match as much as possible first, then the same match should be found later in the document.
I'm using Notepad++ (with PCRE). "." matches "\r" and "\n".
Any pointers are highly appreciated.
The problem is not the greediness or the lazyness of the quantifiers, but the way the regex engine works. The regex engine, when a pattern fails, has the possibility to use the backtracking mechanism to try other possibilities (until the pattern succeeds, or until there is no more possibilities), and that, from a same position in the string.
The only way to moderate this behaviour is to add more contraints in your pattern (to limit possibilities) as you can see in several answers here.
The idea is to check the limits (left and right) of the names using spaces, word boundaries or possessive quantifiers without to forget to do the same for the backreference:
with spaces: (?:\s|^)(CK_\S*)(?=\s.*(?<=\s)\1(?:\s|$))
(a bit long, but probably the most waterproof way)
with word boundaries: \b(CK_\S*)\b(?=.*\b\1\b)
with possessive quantifiers and word boundaries: \b(CK_\S*+)(?=.*\b\1\b)
Note: since the dot is used, you need to switch on the singleline mode for all the patterns.