I have a csv file that contains lines such as
1 Er6 Te100 7.02455 1 0 0 -0.00420
2 Cu7 S150 6.96222 0 -1 0 -0.00421
3 Sc8 P50 7.01955 0 0 0 -0.00397
I would like to use awk to filter the data and display rows that
OR
So far, I am building the command up from its simplest form and is stuck at
awk '{ if (($2 == "Cu*")) { print } }' file
This does not output anything. It seems that "Cu*" is not recognized by AWK as a wildcard, hence there is no match.
I'm also unsure how to deliver the OR statement that contains two and statements.
Help is much appreciated.
I post this as an answer for the sake of formatting, but jhnc should get the "win" for this question, I think.
An awk script, whether in a file or a one-liner is made up of sets of condition { statement }
which are evaluated for every line of input. A blank condition is equivalent to true
, and a blank statement is equivalent to { print }
. Thus, if your intended action is just to print, your statement can be dropped, and if you're just evaluating patterns, your if
can be dropped and the condition
moved to its proper place.
You've got a one-liner already. Know that it can be stored in a file as well:
$ cat test.awk
($2~/^Cu/ && $3~/^(S|Te)/) ||
($2~/^(S|Te)/ && $3~/^Cu/)
$ awk -f test.awk input.txt
2 Cu7 S150 6.96222 0 -1 0 -0.00421
As you can see, I've put your conditions on separate lines for easier reading in case the list grows. Taken altogether, it's still a single awk condition which, if it evaluates as true, takes the default action and prints the line. That said, take care when splitting lines. This script still works because ||
is at the end of the line, so awk knows the condition continues. Placing the ||
at the beginning of the next line would be an error.
I don't know if your input dataset includes possible false positives with these patterns. For example /^s/
would match S150
but also Se10
. If you want to make sure that the letters you're matching are always followed by non-letters, you could use conditions like this:
($2~/^Cu[^[:alpha:]]/ && $3~/^(S|Te)[^[:alpha:]]/) ||
($2~/^(S|Te)[^[:alpha:]]/ && $3~/^Cu[^[:alpha:]]/)
This may seem cumbersome, and is not the only way to do this, but is perhaps worth the mention. If words in these columns will always have numbers after them, you could replace the character class match [^[:alpha:]]
with [0-9]
.