Search code examples
regexgnu-findutils

a bug in findutils about case insensitive?


There is a file named foo.js in the current folder.

I use find to search :

tigerlei::~/work $ ll foo.js
-rw-rw-r-- 1 tigerlei tigerlei 187 Mar 29  2017 foo.js

tigerlei::~/work $ find . -regex '.*/foo.*.j[R-T]+' -regextype egrep
./foo.js

tigerlei::~/work $ find . -regex '.*/foo.*.j[RST]+' -regextype egrep

tigerlei::~/work $ find . -iregex '.*/foo.*.j[RST]+' -regextype egrep
./foo.js

My system is ubuntu 14.04.

findutil's version is 4.4.2

When I use -regex, find will use case sensitive mode. But:

  • [R-T] will match the lowercase letter 's', and
  • [RST] will not match 's'.

Question

Why are those outcomes the results of my searches?


Solution

  • You need to set LC_ALL=C to ensure the characters that form the range in the bracket expression go in the same order as in the ASCII table.

    See this thread:

    If you mean to match a letter in the user's language, use grep '[[:alpha:]]' and don't modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep '[[:alpha:]]' or LC_ALL=C grep '[a-zA-Z]'. [a-z] matches the characters that sort after a and before z (though with many APIs it's more complicated than that). In other locales, you generally don't know what those are. For instance some locales ignore case for sorting so [a-z] in some APIs like bash patterns, could include [B-Z] or [A-Y]. In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them)...