Was trying to get words with consecutive repeated letters occurring twice or thrice. Not able find a way to use quantifier and capture group using ERE
$ grep --version | head -n1
grep (GNU grep) 2.25
$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Works with -P
though
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians
Thanks Casimir et Hippolyte for coming up with simpler input and regex to test this behavior
$ echo 'aazbb' | grep -E '(([a-z])\2[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])\2[a-z]*){3}' || echo 'No match'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}' || echo 'No match'
No match
Same behavior seen with sed
as well
$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo 'aazbb' | sed -E '/(([a-z])\2[a-z]*){2}/! s/.*/No match/'
aazbb
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*/! s/.*/No match/'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){3}/! s/.*/No match/'
No match
Related search links, I checked some of them, but didn't get anything close to this question
If this is solved in newer version of grep
or sed
, let me know. Also, if the issue is seen in non-GNU implementations
I filed an issue https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 and the manual is now updated to reflect such issues.
From https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:
Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.