Search code examples
awkcharactercjkfrequency-analysisword-frequency

Awk: Characters-frequency from one text file?


Given a multilangual .txt files such as:

But where is Esope the holly Bastard
But where is 생 지 옥 이 군
지 옥 이
지 옥
지
我 是 你 的 爸 爸 !
爸 爸 ! ! !
你 不 會 的 !

I counted space-separated words' word-frequency using this Awk function :

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort

Getting the elegant :

1 생
1 군
1 Bastard
1 Esope
1 holly
1 the
1 不
1 我
1 是
1 會
2 이
2 But
2 is
2 where
2 你
2 的
3 옥
4 지
4 爸
5 !

How to change it to count characters-frequency ?


EDIT: For Characters-frequency, I used (@Sudo_O's answer):

$ grep -o '\S' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt

For word-frequency, use:

$ grep -o '\w*' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt

Solution

  • One method:

    $ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' 
    3 옥
    4 h
    2 u
    2 i
    3 B
    5 !
    2 w
    4 爸
    1 군
    4 지
    1 y
    2 l
    1 E
    1 會
    2 你
    1 是
    2 a
    1 不
    2 이
    2 o
    1 p
    2 的
    1 d
    1 생
    3 r
    6 e
    4 s
    1 我
    4 t
    

    Use redirection to save the output to a file:

    $ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' > output
    

    And for sorted output:

    $ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > output