Search code examples
algorithmstatisticslexicon

English texts lexicon comparison


Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library. What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?


Solution

  • First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).

    Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.

    It may also be a good idea to look into machine learning packages such as Weka.

    This book is an excellent source for machine learning and you may find it useful.