Search code examples
awkcountignore-case

How to count occurrences no matter its case?


table

chr10   10482   10484   0   11  +   CA
chr10   10486   10488   0   12  +   ca
chr10   10487   10489   0   13  +   Ca
chr10   10490   10492   0   13  +   cA
chr10   10491   10493   0   12  +   CT
chr10   10494   10496   6.66667 15  +   ca
chr10   10495   10497   6.66667 15  +   cc

I would like the count the number of lines in column 7 where "CA" can be found regardless of the any of the two letters being in upper or lower case.

The desired output would be 5.

The two commands (below) give an empty output

cat table | awk ' $7 ==/^[Cc][Aa]/{++count} END {print count}'

awk 'BEGIN {IGNORECASE = 1} $7==/"CA"/ {++count} END {print count}' table

The below command returns a value of 1

awk 'BEGIN {IGNORECASE = 1} END {if ($7=="CA"){++count} {print count}}' table

Note: my actual table is tens of millions of lines long, thus I do not want to write a table as an intermediate in order to count. (I need to repeat this task for other files too).


Solution

  • There is a little problem in your syntax: you either say var == "string" or var ~ regexp, but you are saying var ~ /"string"/. Using the correct combination makes your command work:

    $ awk '$7 ~ /^[Cc][Aa]/{++count} END {print count+0}' file
    5
    $ awk 'BEGIN {IGNORECASE = 1} $7=="CA" {++count} END {print count+0}' file
    5
    

    Also, you may want to use toupper() (or tolower()) to check this, instead of using the IGNORECASE flag:

    awk 'toupper($7) == "CA" {++count} END {print count+0}' file
    

    Note the trick to print count + 0 instead of just count. This way, we cast the variable to 0 if it wasn't set before. With this, it will print 0 whenever there was no matches; if we would just print count, it would return an empty string.