Search code examples
bashawk

awk - Character Class Regex gsub Changes output


Parsing the following with awk:

$> df -h  /
Filesystem      Size  Used Avail Use% Mounted on
rootfs          476G  370G  106G  78% /

If I use an explicit match for the G's on the values, it works as expected:

$> awk -v indrive="/dev/sda1" 'NR!=1{gsub(/G/,""); print $2,$4,indrive}' <(df -h /)
476 106 /dev/sda1

However, if I genericize it w/a char class:

awk -v indrive="/dev/sda1" 'NR!=1{gsub(/[[:alpha:]]/,""); print $2,$4,indrive}' <(df -h /)
370 78% /dev/sda1

Not sure where 370 and 78% are coming from.

Update: I actually get the same from:

awk -v indrive="/dev/sda1" 'NR!=1{gsub(/[a-zA-Z]/,""); print $2,$4,indrive}' <(df -h /)
370 78% /dev/sda1

But with [[:upper:]] it seems to work fine:

awk -v indrive="/dev/sda1" 'NR!=1{gsub("([[:upper:]])*",""); print $2,$4,indrive}' <(df -h /)
476 106 /dev/sda1

Solution

  • It seems like this is what you're trying to do, using any awk (and cat file in place of your df command for demoing):

    $ cat file |
    awk -v indrive='/dev/sda1' 'NR>1{$0=$2 FS $4; gsub(/[[:alpha:]]/,""); print $0, indrive}'
    476 106 /dev/sda1
    

    or this with GNU awk for gensub():

    $ cat file |
    awk -v indrive='/dev/sda1' 'NR>1{print gensub(/[[:alpha:]]/,"","g",$2 FS $4), indrive}'
    476 106 /dev/sda1
    

    Your code was applying the gsub() across the whole line and so removing $1 and re-splitting $0 into different fields while the above is selecting the input fields first, then doing gsub() on just them.

    Regarding:

    Not sure where 370 and 78% are coming from.

    They're the 3rd and 5th fields from your input after G is removed:

    Filesystem      Size  Used Avail Use% Mounted on
    rootfs          476G  370G  106G  78% /
                          ^^^         ^^^
    

    Regarding:

    gsub(/[[:alpha:]]/,"")...

    ...Update: I actually get the same from...

    gsub(/[a-zA-Z]/,"")...

    The character ranges a-z and A-Z together cover the same set of alphabetic characters present in your input as [:alpha:] does. In some locales they're identical sets of characters.

    Regarding your comment:

    I thought it was just going to apply gsub to $0 and leave the fields intact if possible. Still not sure why [:upper:] and G alone work as expected but I guess that's another question.

    [[:upper:]] (the set of all upper case letters) and G worked because they only match the Gs you want to remove from your input while [[:alpha:]] matches each of the 6 characters in rootfs (lower case letters) in addition to the Gs and so it also removed that whole first field.