A subset of my file looks like this:
row1 ./. 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 ./. 1/1 0/0 0/0
row2 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 ./. 0/0 ./. ./.
row3 ./. 0/1 0/0 0/0 0/0 1/2 5/6 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 1/1 0/0 0/0
row4 ./. 1/1 1/1 0/0 0/0 0/0 0/0 1/6 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 1/1 0/1 0/0
row5 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Values are formatted n/n, where n can equal a number (0-9) or a period (.)
My goal: If columns 2, 3, 4, 20, and 21 are identical, return 1, else return 0. If the columns are identical if those containing "./." were ignored, return 1, else return 0.
Example desired output:
row1 0
row2 1
row3 0
row4 0
row5 1
Row two receives "1" because although there are some instances of "./." in the columns I want to compare, all of the other values in the columns of interest are identical. Row five receives "1" because all of the values in the columns of interest are identical.
I have written this, which does partly what I want (it does not include all necessary combinations of fields):
awk 'BEGIN{OFS=" "}{sign="";{if ( ($2==$3 || $2=="./." || $3=="./.") && ($2==$4 || $4=="./.") && ($2==$20 || $20=="./.") && ($2==$21 || $21=="./.")) {sign="1 "}else{sign="0 "}}; print $2, $3, $4, $20, $21, sign}'test.txt
My full-size file has many more columns that would need to be included in the matching; is there a more concise way to write this?
$ awk 'BEGIN{a[2];a[3];a[4];a[5];a[21]} {for (i in a) if ($i!="./.") b[$i]; print $1,(length(b)==1); delete b}' test.txt
row1 0
row2 1
row3 0
row4 0
row5 1
Or, for people who prefer their code spread over multiple lines:
awk '
BEGIN{
a[2];a[3];a[4];a[5];a[21]
}
{
for (i in a)
if ($i!="./.")
b[$i]
print $1,(length(b)==1)
delete b
}' test.txt
Array a
determines which columns we check:
a[2];a[3];a[4];a[5];a[21]
This will be used to signal that columns 2,3,4,5, and 21 are of interest.
We assign a key to array b
for every different value for the columns defined by a
as long as the column value is not ./.
:
for (i in a) if ($i!="./.") b[$i]`
We print out the results:
print $1,(length(b)==1)
If the length of b
is 1, that means that the columns of interest (excluding the ./.
ones) all had the same value. In that case, we print the row header and 1. If it had a length different from one, we print the row header and 0.
Lastly, we delete b
in preparation for analyzing the next line:
delete b
$ awk -v x='2 3 4 5 21' 'BEGIN{split(x,a)} {for (i in a) if ($a[i]!="./.") b[$a[i]]; print $1,(length(b)==1); delete b}' test.txt
row1 0
row2 1
row3 0
row4 0
row5 1