Search code examples
regexbashshellcygwinglob

Weird behavior of BASH glob/regex ranges


I'm seeing BASH bracket ranges (e.g. [A-Z]) behaving in an unexpected way.
Is there's an explanation for such behavior, or it is a bug?

Let's say I have a variable, from which I want to strip all uppercase letters:

$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"

The result I get is this:

a0123

If I do it with sed, I get an expected result:

$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123

The same seems to be the case for BASH built-in regex match:

$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0

If I check all lowercase letters from 'a' to 'z', it seems that only 'a' is an exception:

$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a

I do not have case-insensitive matching enabled, and even if I did, it should not make letter 'a' behave differently:

$ shopt -p nocasematch
shopt -u nocasematch

For the reference, I'm using Cygwin, and I don't see this behavior on any other machine:

$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

EDIT:

I've found the exact same issue reported here: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So, I guess it's a bug(?) of "en_GB.UTF-8" collation, but not BASH itself.
Setting LC_COLLATE=C indeed solves this.


Solution

  • It certainly had to do with setting of your locale. An excerpt from the GNU bash man page under Pattern Matching

    [..] in the default C locale, [a-dx-z] is equivalent to [abcdxyz]. Many locales sort characters in dictionary order, and in these locales [a-dx-z] is typically not equivalent to [abcdxyz]; it might be equivalent to [aBbCcDdxXyYz], for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value C, or enable the globasciiranges shell option.[..]

    Use the POSIX character-classess, [[:upper:]] in this case or change your locale setting LC_ALL or LC_COLLATE to C as mentioned above.

    LC_ALL=C var='ABCDabcd0123'
    echo "${var//[A-Z]/}"
    abcd0123
    

    Also, your negative test to do upper-case check will fail for all the lower case letters when setting this locale hence printing the letters,

    LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
    

    Also, under the above locale setting

    [[ a =~ [A-Z] ]] ; echo $?
    1
    [[ b =~ [A-Z] ]] ; echo $?
    1
    

    but will be true for all lower-case ranges,

    [[ a =~ [a-z] ]] ; echo $?
    0
    [[ b =~ [a-z] ]] ; echo $?
    0
    

    Said this, all these can be avoided by using the POSIX specified character classes, under a new shell without any locale setting,

    echo "${var//[[:upper:]]/}"
    abcd0123
    

    and

    for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done