Search code examples
regexunixunicodegrep

grep with regex wrongly selects unicode characters


I have run grep with the following regex:

grep -e "^[a-zA-Z]" file.txt

the point is to only get lines that start with alphabetic characters in the ascii range, which works, if I explicitly type out the alphabet like

grep -e "^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]" file.txt

which is odd already, because that's what [a-zA-Z] is supposed to specify. When I look at my input data's matches with the first regex, we get matches like:

🅱

notice that fi and fl are one character in these cases.

Technically, the explicit typing of the alphabet is a solution, but I'd rather want to

  • know why [a-zA-Z] doesn't work
  • if a sensible solution exists, see what that'd look like.

Solution

  • grep is locale aware. [a-zA-Z] can match non-ASCII characters depending on your locale (e.g. á, ä, ø, æ). To force ASCII (and not handle any multibyte characters), set the C locale:

    LC_ALL=C grep -e '^[a-zA-Z]' file.txt