Suppose I have files similar to the following.
file 1
1,144931087,144931087,T,C
16,89017167,89017167,C,G
17,7330235,7330235,G,T
17,10222478,10222478,C,T
file 2
1,144931087,144931087,T,C
16,89017167,89017167,C,G
17,10222478,10222478,C,T
file 3
17,10222478,10222478,C,T
I would like to find how many times duplicated values are present in each file, So ideally, the output would be like:
Output
2 1,144931087,144931087,T,C
2 16,89017167,89017167,C,G
3 17,10222478,10222478,C,T
1 17,7330235,7330235,G,T
I used the following command for counting the duplicates value.
sort Test1.csv Test2.csv Test3.csv | uniq --count
Now I wish to add the file name for the counted output. My desired output should look like this:
Test1 Test2 2 1,144931087,144931087,T,C
Test1 Test2 2 16,89017167,89017167,C,G
Test1 Test2 Test 3 3 17,10222478,10222478,C,T
Test1 1 17,7330235,7330235,G,T
Can anyone help me to get the desired output or can anyone suggest me a better way to get my desired output?
Using awk. Sorry about my clever file naming scheme:
$ awk '{
a[$0]++ # count hits
b[$0]=b[$0] FILENAME " " # store filenames
}
END {
for(i in a)
print b[i] a[i],i # output them
}' foo bar baz
foo bar 2 1,144931087,144931087,T,C
foo bar 2 16,89017167,89017167,C,G
foo bar baz 3 17,10222478,10222478,C,T
foo 1 17,7330235,7330235,G,T
UPDATED per comments:
$ awk 'BEGIN {
FS=OFS=","
}
{
a[$1 OFS $2 OFS $3 OFS $4]++
b[$1 OFS $2 OFS $3 OFS $4]=b[$1 OFS $2 OFS $3 OFS $4] FILENAME "|"
c[$1 OFS $2 OFS $3 OFS $4]=$0 # keep the last record with
} # specific key combination
END {
for(i in a)
print b[i] "," a[i],c[i]
}' foo bar baz
foo|bar|,2,16,89017167,89017167,C
foo|,1,17,7330235,7330235,G
foo|bar|,2,1,144931087,144931087,T
foo|bar|baz|,3,17,10222478,10222478,C