Search code examples
bashgrephexbsdnull-character

Grep with a regex character range that includes the NULL character


When I include the NULL character (\x00) in a regex character range in BSD grep, the result is unexpected: no characters match. Why is this happening?

Here is an example:

$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']

Here I expect all characters up until the last one to match, however the result is no output (no matches).

Alternatively, when I start the character range from \x01, it works as expected:

$ echo 'ABCabc<>/ă' | grep -o [$'\x01'-$'\x7f']
A
B
C
a
b
c
<
>
/

Also, here are my grep and BASH versions:

$ grep --version
grep (BSD grep) 2.5.1-FreeBSD

$ echo $BASH_VERSION
3.2.57(1)-release

Solution

  • Noting that $'...' is a shell quoting construct, this,

    $ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']
    

    would try to pass a literal NUL character as part of the command line argument to grep. That's impossible to do in any Unix-like system, as the command line arguments are passed to the process as NUL-terminated strings. So in effect, grep sees just the arguments -o and [.

    You would need to create some pattern that matches the NUL byte without including it literally. But I don't think grep supports the \000 or \x00 escapes itself. Perl does, though, so this prints the input line with the NUL:

    $ printf 'foo\nbar\0\n' |perl -ne 'print if /\000/'
    bar
    

    As an aside, at least GNU grep doesn't seem to like that kind of a range expression, so if you were to use that, you'd to do something different. In the C locale, [[:cntrl:][:print:]]' might perhaps work to match the characters from \x01 to \x7f, but I didn't check comprehensively. The manual for grep has some descriptions of the classes.


    Note also that [$'\x00'-$'\x7f'] has an unquoted pair of [ and ] and so is a shell glob. This isn't related to the NUL byte, but if you had files that match the glob (any one-letter names, if the glob works on your system -- it doesn't on my Linux), or had failglob or nullglob set, it would probably give results you didn't want. Instead, quote the brackets too: $'[\x00-\x7f]'.