Search code examples
bigdatadata-analysispattern-recognitionfrequency-analysisword-frequency

How to analyze frequency of characters in a text file


I have a text file which has approximately 25 millions of lines included. Data on the lines are similiar below:

12ertwrtrdfger
897erterterte
545ret3w2trewt345
968587563453345
89753647565344553


I want to analyze most frequent prefixes and suffixes. In example above you can see that 2 lines are starting with 897 and two lines are ending with 345, I want to see which prefix/suffixes are the most frequent. I also want to get the results as bar/piechart. Any data analysis program does that kind of analysis?


Solution

  • I've solved my problem with the code below:

    sed abc.txt <abc.txt | cut -c 1-5 | sort | uniq -cd | sort -nbr > pre5.txt