I would like to count number of times appear the different susbtrings into a set of strings in 2nd column inside a tab file. So, in this way I'm doing an split to separate every substring and then try to count them. However does not work correctly.
The input is like
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA
The desired output
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1
and so on....
awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ")} {for (i=1;i<=gf;i++){
if (gfp[i]=="AA"){i++; printf $1FS$2FS"%s\n" i, gfp[i]}
else if (gfp[i]=="AC" || gfp[i] == "CA"){i++; printf $1FS$2FS"%s"gfp[i]"="i";\n"}
}}' input > output
and also I'm try to do other script but I think count repeating each count the same number of times that take place for every row. Here I have performed an split under the first split to discern between substrings
awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ");} {for (i=1;i<=gf;i++){
print par "=" gfeach[i]";"
}' input > output
I'm for sure there are some more easy ways to do it but I cannot get solve completely. Is it possible to do in UNIX environment? Thanks in advance
Your input doesn't match your output so we're all just guessing but this might be what you want:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
delete cnt
split($2,tmp,/ /)
for (i in tmp) {
str = tmp[i]
printf "%s", $0
sep = OFS
for (str in cnt) {
printf "%s%s=%d", sep, str, cnt[str]
sep = ";"
print ""
Depending on what your input really is the above will output the following:
$ cat file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA
$ awk -f tst.awk file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA AA AA=11
$ cat file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC
$ awk -f tst.awk file
rs12255619 A/C chr10 AA AA AC AA AA AA AA AA AA AC AA AA=9;AC=2
rs7909677 A/G chr10 AA AA AA AA AA AA AA AA AA AA CC AA=10;CC=1