Search code examples
shellgrepgnu-coreutils

Issues matching zero-byte using grep


I'm trying to find 7zip version 3 file headers in a file. According to the documentation they should look like this:

00: 6 bytes: 37 7A BC AF 27 1C        - Signature 
06: 2 bytes: 00 04                    - Format version

So I constructed this grep command which should match them:

grep --only-matching --byte-offset --binary --text $'7z\xBC\xAF\x27\x1C\x00\x03'

Yet it also matches the string ending in 0000:

% xxd -p -r <<< "aaaa 377a bcaf 271c 0000 bbbb 00 377a bcaf 271c 0003" | grep --only-matching --byte-offset --binary --text $'7z\xBC\xAF\x27\x1C\x00\x03'
2:7z'
13:7z'

The output I expect to have is just 13:7z'


Solution

  • It's not possible to pass zero byte as part of an argument. Because a string ends with zero byte in C, so grep when running strlen(argv[...]) will not "see" anything after zero byte.

    If there are no newlines in regex, you could use --file=.

    xxd -p -r <<< "aaaa 377a bcaf 271c 0000 bbbb 00 377a bcaf 271c 0003" |
    LC_ALL=C grep --only-matching --byte-offset --binary --text -f <(
        echo -n 7z;
        echo BCAF271C0003 | xxd -r -p
    )
    

    see https://www.gnu.org/software/grep/manual/grep.html#Matching-Non_002dASCII

    Alternatively use PERL regex

    xxd -p -r <<< "aaaa 377a bcaf 271c 0000 bbbb 00 377a bcaf 271c 0003" | 
    LC_ALL=C grep --only-matching --byte-offset --binary --text -P '7z\xBC\xAF\x27\x1C\x00\x03'
    

    When dealing with binary, remember to disable UTF-8 sequences handling with locale setting LC_ALL=C.

    Note: <<<"" and $'string' are not available in any shell - they are available in bash.