I want to compare 3 files together to see how much of the information in the files are the same. The file format is something like this:
Chr11 447 . A C 74 . DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78 GT:PL:GQ 1/1:107,51,0:99
Chr10 449 . G C 35 . DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1 GT:PL:GQ 0/1:65,0,38:41
Chr12 517 . G A 222 . DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282 GT:PL:GQ 1/1:255,255,0:99
Chr10 761 . G A 41 . DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1 GT:PL:GQ 0/1:71,0,116:74
I'm only interested in the first two columns (if the first two columns are the same then I consider it as equal). This is the comand that I use for comparing two files :
awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)' file1 file2 | wc -l
I would like to use the awk command since my files are really big and awk handle them really good! but I couldn't figure out how to use it for 3 files!
If it's simply to print out the pairs (column1 + column2) that are common in all 3 files, and making use of the fact that a pair is unique within a file, you could do it this way:
awk '{print $1" "$2}' a b c | sort | uniq -c | awk '{if ($1==3){print $2" "$3}}'
This can be made with arbitrary numbers of files as long as you modify the param of the last command.
Here's what it does:
awk '{print $1" "$2}' a b c | sort
)uniq -c
)If you're doing this often, you can express it as a bash function (and drop it in your .bashrc
) which parametrises the file counts.
function common_pairs {
awk '{print $1" "$2}' $@ | sort | uniq -c | awk -v numf=$# '{if ($1==numf){print $2" "$3}}';
}
Call it with any number of files you want: common_pairs file1 file2 file3 fileN