Search code examples
linuxawktext-processing

Counting string occurrences in second column which corresponds to first columns of a file


I have this Input text file:

CD196_RS15035       normal alleles
CD196_RS15035       normal alleles
CD196_RS15035       truncation in the allele
CD196_RS15035       truncation in the allele
CD196_RS15035       no stop for allele
CD196_RS15035       no stop for allele
CD196_RS16835       normal alleles
CD196_RS16835       truncation in the allele
CD196_RS16835       no stop for allele
CD196_RS16835       no stop for allele

I want to count the number of times each string occurs in the second column which corresponds to the first column.

I want Output text file like this:

CD196_RS15035  normal alleles  2    truncation in the allele   2    no stop for allele  2
 
CD196_RS16835  normal alleles  1    truncation in the allele   1    no stop for allele  2

Any tip would be helpful. Thank you.


Solution

  • With awk's multidimensional array:

    awk -F'[ ]{2,}'
      '{ a[$1][$2]+=1 }
       END{ 
           for (i in a) { 
               printf("%s ", i);
               for (j in a[i]) printf("%s %d ", j, a[i][j]); 
               print "";  
           }
       }'
      test.txt
    

    CD196_RS15035 normal alleles 2 no stop for allele 2 truncation in the allele 2 
    CD196_RS16835 normal alleles 1 no stop for allele 2 truncation in the allele 1