Search code examples
awkwildcard

Using AWK to filter csv data by column values, with wildcards


I have a csv file that contains lines such as

      1          Er6          Te100      7.02455       1   0   0           -0.00420 
      2          Cu7          S150       6.96222       0  -1   0           -0.00421 
      3          Sc8          P50        7.01955       0   0   0           -0.00397 

I would like to use awk to filter the data and display rows that

  1. have their second column having values of Cu* and the third column having values of S* or Te*

OR

  1. have their second column having values of S* or Te* and the third column having values of Cu*

So far, I am building the command up from its simplest form and is stuck at

awk '{ if (($2 == "Cu*")) { print } }' file

This does not output anything. It seems that "Cu*" is not recognized by AWK as a wildcard, hence there is no match.

I'm also unsure how to deliver the OR statement that contains two and statements.

Help is much appreciated.


Solution

  • I post this as an answer for the sake of formatting, but jhnc should get the "win" for this question, I think.

    An awk script, whether in a file or a one-liner is made up of sets of condition { statement } which are evaluated for every line of input. A blank condition is equivalent to true, and a blank statement is equivalent to { print }. Thus, if your intended action is just to print, your statement can be dropped, and if you're just evaluating patterns, your if can be dropped and the condition moved to its proper place.

    You've got a one-liner already. Know that it can be stored in a file as well:

    $ cat test.awk
    
    ($2~/^Cu/ && $3~/^(S|Te)/) ||
    ($2~/^(S|Te)/ && $3~/^Cu/)
    
    $ awk -f test.awk input.txt
          2          Cu7          S150       6.96222       0  -1   0           -0.00421
    

    As you can see, I've put your conditions on separate lines for easier reading in case the list grows. Taken altogether, it's still a single awk condition which, if it evaluates as true, takes the default action and prints the line. That said, take care when splitting lines. This script still works because || is at the end of the line, so awk knows the condition continues. Placing the || at the beginning of the next line would be an error.

    I don't know if your input dataset includes possible false positives with these patterns. For example /^s/ would match S150 but also Se10. If you want to make sure that the letters you're matching are always followed by non-letters, you could use conditions like this:

    ($2~/^Cu[^[:alpha:]]/ && $3~/^(S|Te)[^[:alpha:]]/) ||
    ($2~/^(S|Te)[^[:alpha:]]/ && $3~/^Cu[^[:alpha:]]/)
    

    This may seem cumbersome, and is not the only way to do this, but is perhaps worth the mention. If words in these columns will always have numbers after them, you could replace the character class match [^[:alpha:]] with [0-9].