I have text file
('1', 6310445) [12, 20]_S:0.6:0:ACAAAAAAAAAAA_i_V
('1', 17704109) [12, 31]_S:0.387:0:CCCCCCCCCCCC_i_V
('1', 18922274) [8, 22]_S:0.364:0:AAAAAAAA_i_V
('1', 22750694) [8, 19]_S:0.421:0:TTTTTTTT_i_V
('1', 25564545) [9, 23]_S:0.391:0:AAAAAAAAA_i_V
('1', 29189562) [13, 34]_S:0.382:0:AAAAAAAAAAAAA_i_V
('1', 30166561) [14, 20]_S:0.7:0:TTTTTTTTTTTTTT_i_V
('1', 30450439) [9, 14]_S:0.643:0:AAAAAAAAA_i_V
('1', 30981321) [12, 23]_S:0.522:0:AAAAAAAAAAAA_i_V
And I want to print the most frequently occurring letter between the last ":" and first "_".
Which means
"ACAAAAAAAAAAA" => A, "CCCCCCCCCCCC": => C . . . .
The output will be
A C A T A A T A A
How can I do?
You can use a simple reduce-style approach:
awk -F: -v ORS= '
NF>1 && split($NF,a,/_/)>1 {
for (i=length(s=a[1]); i>0; i--)
if (++n[c=substr(s,i,1)] > n[r])
r=c
print r OFS
split(r="",n) # reset state
}
END { print "\n" }
' textfile
If multiple characters appear most frequently (eg. ABCABCABC
), then the first to reach the maximum will be printed.