Search code examples
regexawklanguage-lawyerposix

Treatment of backslash character in the bracket expression


The section 3.4 Using Bracket Expressions of GNU awk manual, reads

To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:
     [d\]]
matches either ‘d’ or ‘]’. Additionally, if you place ‘]’ right after the opening ‘[’, the closing bracket is treated as one of the characters to be matched.

The treatment of ‘\’ in bracket expressions is compatible with other awk implementations and is also mandated by POSIX.

On the other hand, the section Regular Expressions of POSIX awk doesn't list the \] as having a special meaning. Here are a few experiments with GNU awk (version 5.3.1) and GNU grep (version 3.11) that expose conflicting treatment of the \ in a bracket expression:

$ echo d | awk '/[d\]]/'
d
$ echo d | grep -E '[d\]]'
$ echo ']' | awk '/[d\]]/'
]
$ echo ']' | grep -E '[d\]]'

The question is:
is the GNU awk documentation wrong in claiming that the treatment of \ in a bracket expression in GNU awk is mandated by POSIX, or have I overlooked something?
In other words, does the GNU awk violate the POSIX specification?


Solution

  • The POSIX reference that allows awk to interpret \ in a bracket expression as an escape character is in the table under Regular Expressions in the POSIX awk spec (emphasis mine and note in particular the last 2 rows of the table):

    Regular Expressions

    ... these escape sequences shall be recognized both inside and outside bracket expressions ...

    Escape Sequence Description Meaning
    \" <backslash> <quotation-mark> In the lexical token STRING, character. Otherwise undefined.
    \/ <backslash> <slash> In the lexical token ERE, <slash> character. Otherwise undefined.
    \ddd A <backslash> character followed by the longest sequence of one, two, or three octal-digit characters (01234567). ... The character whose encoding is represented by the one, two, or three-digit octal integer...
    \., \[, \(,\*, \+, \?, \{, \|, \^, \$ A <backslash> character followed by a character that has a special meaning in EREs ... other than <backslash>. In the lexical token ERE when not inside a bracket expression, the sequence shall represent itself. Otherwise undefined.
    \\ Two <backslash> characters. In the lexical token ERE, the sequence shall represent itself...
    \c A <backslash> character followed by any character not described in this table or in the table in XBD 5. File Format Notation ('\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v'). Undefined

    That means that in a POSIX-compliant awk, inside or outside of a bracket expression, \\ is mandated to mean a literal \ and the meaning of \c, where c is any character not listed in the table (e.g. ]), is undefined by POSIX and so gawk can treat it however it likes, hence allowing [d\]] to mean "d or ]", for example.

    So no, gawk is not violating the POSIX awk spec (which supersedes the POSIX regexp spec for describing awk behavior) in its treatment of \, as it is treating \ in the way required for [\\] and as allowed (since its meaning is undefined) for [d\]].