Search code examples
regexsedposix

sed why POSIX blacket expression needs to be in another blacket?


Qustion

What is the reason that POSIX expression such as [:space:] needs to be in another [ ] ?

$ echo "a b c" | sed 's/[:space:]*/_/g'
_ _b_ _

$ echo "a b c" | sed 's/[[:space:]]*/_/g'
_a_b_c_

$ echo "a b c" | sed 's/[[:space:]][[:space:]]*/_/g'
a_b_c

Update

Regular Expressions/POSIX Basic Regular Expressions

Character classes
The POSIX standard defines some classes or categories of characters as shown below. These classes are used within brackets.

I had not understood what the character classes was but assumed it was a special character matching any white spaces, hence believed 's/[:space:]/_g/' would match space in-between "a b", however I suppose '[:space:]' itself would not match any character (please correct if this is still wrong).

I suppose [:space:] is like '\t\n\r\f\v' but by itself has no function. With blacket '[[:space:]]', it then has the function same as '[\t\n\r\f\v]'.


Solution

  • You need to understand the terminology:

    A bracket expression is a set of characters enclosed in [ and ] and can be used as such in a regexp. That set of characters can be represented by any combination of any of the following (and an optional initial ^ negation character):

    1. A character list, e.g. abcd...z, or
    2. A character range, e.g. a-z, or
    3. A character class, e.g. [:lower:]

    So [:space:] is a character class (representing all white space chars) and that can be used within a bracket expression [...] in a regexp just like if you specifically listed all white space chars within the bracket expression [...]. So this:

    [:space:]
    

    is just a character class, while this:

    [[:space:]]
    

    is a bracket expression which includes all white space chars and this:

    [[:space:][:lower:]_#;A-D]
    

    is a bracket expression which includes tall white space chars plus all lower case letters plus the chars _, #, and ; plus the letters in the range A through D (whatever those chars are in your locale).