Search code examples
regexunicodegrepasciinon-ascii-characters

(grep) Regex to match non-ASCII characters?


On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?


Solution

  • This will match a single non-ASCII character:

    [^\x00-\x7F]
    

    This is a valid PCRE (Perl-Compatible Regular Expression).

    You can also use the POSIX shorthands:

    • [[:ascii:]] - matches a single ASCII char
    • [^[:ascii:]] - matches a single non-ASCII char

    [^[:print:]] will probably suffice for you.**