awk - Character Class Regex gsub Changes output

Parsing the following with awk:

$> df -h  /
Filesystem      Size  Used Avail Use% Mounted on
rootfs          476G  370G  106G  78% /

If I use an explicit match for the G's on the values, it works as expected:

$> awk -v indrive="/dev/sda1" 'NR!=1{gsub(/G/,""); print $2,$4,indrive}' <(df -h /)
476 106 /dev/sda1

However, if I genericize it w/a char class:

awk -v indrive="/dev/sda1" 'NR!=1{gsub(/[[:alpha:]]/,""); print $2,$4,indrive}' <(df -h /)
370 78% /dev/sda1

Not sure where 370 and 78% are coming from.

Update: I actually get the same from:

awk -v indrive="/dev/sda1" 'NR!=1{gsub(/[a-zA-Z]/,""); print $2,$4,indrive}' <(df -h /)
370 78% /dev/sda1

But with [[:upper:]] it seems to work fine:

awk -v indrive="/dev/sda1" 'NR!=1{gsub("([[:upper:]])*",""); print $2,$4,indrive}' <(df -h /)
476 106 /dev/sda1

Solution

It seems like this is what you're trying to do, using any awk (and cat file in place of your df command for demoing):

$ cat file |
awk -v indrive='/dev/sda1' 'NR>1{$0=$2 FS $4; gsub(/[[:alpha:]]/,""); print $0, indrive}'
476 106 /dev/sda1

or this with GNU awk for gensub():

$ cat file |
awk -v indrive='/dev/sda1' 'NR>1{print gensub(/[[:alpha:]]/,"","g",$2 FS $4), indrive}'
476 106 /dev/sda1

Your code was applying the gsub() across the whole line and so removing $1 and re-splitting $0 into different fields while the above is selecting the input fields first, then doing gsub() on just them.

Regarding:

Not sure where 370 and 78% are coming from.

They're the 3rd and 5th fields from your input after G is removed:

Filesystem      Size  Used Avail Use% Mounted on
rootfs          476G  370G  106G  78% /
                      ^^^         ^^^

Regarding:

gsub(/[[:alpha:]]/,"")...

...Update: I actually get the same from...

gsub(/[a-zA-Z]/,"")...

The character ranges a-z and A-Z together cover the same set of alphabetic characters present in your input as [:alpha:] does. In some locales they're identical sets of characters.

Regarding your comment:

I thought it was just going to apply gsub to $0 and leave the fields intact if possible. Still not sure why [:upper:] and G alone work as expected but I guess that's another question.

[[:upper:]] (the set of all upper case letters) and G worked because they only match the Gs you want to remove from your input while [[:alpha:]] matches each of the 6 characters in rootfs (lower case letters) in addition to the Gs and so it also removed that whole first field.