Search code examples
bashunixawksubstr

Count number of different occurrences in a string by UNIX along one column into a file


I would like to count number of times appear the different susbtrings into a set of strings in 2nd column inside a tab file. So, in this way I'm doing an split to separate every substring and then try to count them. However does not work correctly.

The input is like

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA

The desired output

rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA   AA=9;AC=2
rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC   AA=10;CC=1

and so on....

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ")} {for (i=1;i<=gf;i++){
                                      if (gfp[i]=="AA"){i++; printf $1FS$2FS"%s\n" i, gfp[i]}
                                      else if (gfp[i]=="AC" || gfp[i] == "CA"){i++; printf $1FS$2FS"%s"gfp[i]"="i";\n"}
                                                            }}' input > output

and also I'm try to do other script but I think count repeating each count the same number of times that take place for every row. Here I have performed an split under the first split to discern between substrings

awk 'BEGIN {FS=OFS="\t"} {gf=split($2,gfp," ");} {for (i=1;i<=gf;i++){

                     par=gfp[i];
                     gfeach=split($2,gfpeach,par);
                     print par "=" gfeach[i]";"
                                              }
                      }' input > output

I'm for sure there are some more easy ways to do it but I cannot get solve completely. Is it possible to do in UNIX environment? Thanks in advance


Solution

  • Your input doesn't match your output so we're all just guessing but this might be what you want:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    {
        delete cnt
        split($2,tmp,/ /)
        for (i in tmp) {
            str = tmp[i]
            cnt[str]++
        }
    
        printf "%s", $0
        sep = OFS
        for (str in cnt) {
            printf "%s%s=%d", sep, str, cnt[str]
            sep = ";"
        }
        print ""
    }
    

    Depending on what your input really is the above will output the following:

    $ cat file
    rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
    rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA
    
    $ awk -f tst.awk file
    rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
    rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA AA        AA=11
    
    $ cat file
    rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA
    rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC
    
    $ awk -f tst.awk file
    rs12255619 A/C chr10    AA AA AC AA AA AA AA AA AA AC AA        AA=9;AC=2
    rs7909677 A/G chr10     AA AA AA AA AA AA AA AA AA AA CC        AA=10;CC=1