Search code examples
linuxawkseparator

Print the most frequently occuring letter in a string using AWK


I have text file

('1', 6310445)  [12, 20]_S:0.6:0:ACAAAAAAAAAAA_i_V
('1', 17704109) [12, 31]_S:0.387:0:CCCCCCCCCCCC_i_V
('1', 18922274) [8, 22]_S:0.364:0:AAAAAAAA_i_V
('1', 22750694) [8, 19]_S:0.421:0:TTTTTTTT_i_V
('1', 25564545) [9, 23]_S:0.391:0:AAAAAAAAA_i_V
('1', 29189562) [13, 34]_S:0.382:0:AAAAAAAAAAAAA_i_V
('1', 30166561) [14, 20]_S:0.7:0:TTTTTTTTTTTTTT_i_V
('1', 30450439) [9, 14]_S:0.643:0:AAAAAAAAA_i_V
('1', 30981321) [12, 23]_S:0.522:0:AAAAAAAAAAAA_i_V

And I want to print the most frequently occurring letter between the last ":" and first "_".

Which means

"ACAAAAAAAAAAA" => A, "CCCCCCCCCCCC": => C . . . .

The output will be

A C A T A A T A A

How can I do?


Solution

  • You can use a simple reduce-style approach:

    awk -F: -v ORS= '
        NF>1 && split($NF,a,/_/)>1 {
            for (i=length(s=a[1]); i>0; i--)
                if (++n[c=substr(s,i,1)] > n[r])
                    r=c
            print r OFS
    
            split(r="",n) # reset state
        }
        END { print "\n" }
    ' textfile
    

    If multiple characters appear most frequently (eg. ABCABCABC), then the first to reach the maximum will be printed.