Search code examples
regexbashunixgrep

Searching palindromes with grep -E (egrep)


UPD: according to Ed Morton's answer the issue isn't related to -E option only.

Reading The Unix Programming Environment by Brian W. Kernighan and Rob Pike.

Working on an exercise, where one should find all palindromes in a wordlist using grep -E.

The most concise solution I've been able to find online and understand would be

'^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$', which should find all palindromes up to 11 characters.

For some reason though, grep in my terminal gives me results that I can't explain:

me@machine:~/test$ cat sample
a
ab
abba
abcdef
abcba
zufolo
me@machine:~/test$ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample
a
abba
abcba
zufolo
me@machine:~/test$ grep -E '^(.?)(.?)(.?)(.?).?\4\3\2\1$' sample
a
abba
abcba

Why would it add zufolo to the results (while not coloring any matches as it does with the actual palindromes)? Why does zufolo disappear when I reduce the regex to up to 9 characters? output

Does it have anything to do with my setup (the most standard one)?

me@machine:~/test$ ls -l /proc/$$/exe
lrwxrwxrwx 1 me me 0 jan 15 18:37 /proc/3528/exe -> /usr/bin/bash
me@machine:~/test$ cat /proc/version
Linux version 6.5.0-14-generic (buildd@lcy02-amd64-110) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #14~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 20 18:15:30 UTC 2

P.S. zufolo is from wamerican-insane where I'm actually testing it. There are plenty of others like xenian, tartar.

I tried researching online, modifying regex, but can't get my head around this one.


Solution

  • Strangely enough, removing the $ from the end of the regexp (i.e. making it less restrictive) produces fewer matches, which is the opposite of what it should do:

    a) With the $ at the end of the regexp:

    $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$' sample
    a
    abba
    abcba
    zufolo
    

    b) Without the $ at the end of the regexp:

    $ grep -E '^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1' sample
    a
    abba
    abcba
    

    It's not just GNU grep that behaves strangely, GNU sed has the same behavior from the question when just matching with sed -nE '/.../p' sample as GNU grep does AND sed behaves differently if we're just doing a match vs if we're doing a match + replace.

    For example here's sed doing a match+replacement and behaving the same way as grep above:

    a) With the $ at the end of the regexp:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/&/p' sample
    a
    abba
    abcba
    zufolo
    

    b) Without the $ at the end of the regexp:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/&/p' sample
    a
    abba
    abcba
    

    but here's sed just doing a match and behaving differently from any of the above:

    a) With the $ at the end of the regexp (note the extra ab in the output):

    $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/p' sample
    a
    ab
    abba
    abcba
    zufolo
    

    b) Without the $ at the end of the regexp (note the extra ab and abcdef in the output):

    $ sed -nE '/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1/p' sample
    a
    ab
    abba
    abcdef
    abcba
    zufolo
    

    Also interestingly this:

    $ sed -nE 's/^(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1$/<&>/p' sample
    

    outputs:

    <a>
    <abba>
    <abcba>
    <>zufolo
    

    the last line of which means the regexp is apparently matching the start of the line and ignoring the $ end-of-string metachar present in the regexp!

    The odd behavior isn't just associated with using -E, though, if I remove -E and just use POSIX compliant BREs then:

    a) With the $ at the end of the regexp:

    $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$' sample
    a
    abba
    abcba
    zufolo
    

    $ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/&/p' sample
    a
    abba
    abcba
    zufolo
    

    b) Without the $ at the end of the regexp:

    $ grep '^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1' sample
    a
    abba
    abcba
    

    $ sed -n 's/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/&/p' sample
    a
    abba
    abcba
    

    and again just doing a match in sed below behaves differently from the sed match+replacements above:

    a) With the $ at the end of the regexp:

    $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1$/p' sample
    a
    ab
    abba
    abcba
    zufolo
    

    b) Without the $ at the end of the regexp:

    $ sed -n '/^\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\)\(.\{0,1\}\).\{0,1\}\5\4\3\2\1/p' sample
    a
    ab
    abba
    abcdef
    abcba
    zufolo
    

    The above shows that, given the same regexp, sed is apparently matching different strings depending on whether it's doing a substitution or not.

    So, there appear to be bugs in GNU grep and GNU sed without the -E. Or maybe the above is all just undefined behavior given those regexps and so any grep or sed can do whatever they like with them.

    $ grep --version | head -1
    grep (GNU grep) 3.11
    
    $ sed --version | head -1
    sed (GNU sed) 4.9