English texts lexicon comparison

Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library. What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?

Solution

First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).

Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.

It may also be a good idea to look into machine learning packages such as Weka.

This book is an excellent source for machine learning and you may find it useful.

Evenly distributing n points on a sphere
parallel verlet ball to ball collision detection handling
How to use XORShift algorithm to generate random numbers
What is the purpose of a LinkedList in Java Considering an ArrayList Has No Size Limit?
What is Big O in this example of code, O(N) or O(N^2)?
Time complexity of recursive permutation printer
Kth Smallest Element in multiple sorted arrays
Quick and reliable algorithm to determine the existence of a QR code in an image?
Implementation of FitzHugh-Nagumo diffusion model diverging by first iteration
Find the smallest positive integer that does not occur in a given sequence
given binary string flip a segment to get maximum ones
How to compute the integer absolute value
Complexity for converting any propositional formula to CNF format
Find the maximum possible mex (minimal excluded) number from a square matrix
4-Sum algorithm failing with duplicate values in Java
How to check the greater value in Objective-C?
Python generate all n-permutations of n lists
How to create possible sets of n numbers from m-sized prime number list?
Peak and Flag Codility latest chellange
Algorithm For Generating Unique Colors
Understanding Knuth-Morris-Pratt Algorithm
Fast algorithm for repeated calculation of percentile?
How put an algorithm drafted on paper into a working C code?
4SUM variation in quadratic complexity (Python 3.5)
All possible decision paths / outcomes given multiple choices at each decision
Why is accessing any single element in an array done in constant time ( O(1) )?
Rolling or sliding window iterator?
Iterative tree walking
Why is calling free causing segfault
Java, sorting analysis. Heapsort, Quicksort1, Quicksort2, Mergesort, given a blackbox