Recently, I have found Shannon's book and saw an interesting chapter: http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
What I would like to do is to create a function in python such that given a text and given an order, create an other text of length N with k-order approximation of the stochastic properties of a text:
def create_text(text, N, n)
However, I do not know the algorithm to implement such a function. I suppose that I should count the frequences of the letters, but how do I correlate some letters with others according to a given order?
Counting frequences of one letter it is easy. I started counting frequencies of two letters (I think they are called bigrams). Well, I implemented a recursive function, which seems to work with small tetxs, but with bigger ones it just exceeds the recursion depth. When I tried to change the limit of the recursion depth, then it just hanged. However, with small texts it almost works:
# Input: text, letter, (recursive argument), dictionary for bigrams
def find_next (txt, x, index_txt, dic):
total = txt.count(x)
current = txt.find(x, index_txt)
if current == -1: return
print (current, txt[current+1])
# Set dictionary
keys = dic.keys()
index_dic = x+txt[current+1]
if index_dic in keys:
dic[index_dic] += 1/total
else:
dic[index_dic] = 1/total
find_next(txt, x, current+1, dic)
This function counts the frequency of appearance of letter x+another one. However, when the text ends in x, the function fails becuase of the out of range. I see that it is in line txt[current+1] the error, but I cannot figure out how to loop back. I mean, when I reach the last letter, go back to the first and end.
Example:
f = "this is amazing"
If I lauch find_next (f, 'g', 0, {}) the function fails at the end, because I don't know how to connect g to the first letter t.
I suppose this can be solved. But then, when I calculate the frequencies of unique letters (a, b, c, ...) and then bigrams (ae, ar, at, bg, bj, ...), what next..? Trigrams and so on? I cannot figure out the algorithm.
Thanks in advance
The typical approach to this problem is as follows:
Create a n-dimensional array for counting the n-grams of an existing text (no need for recursion here, this is a simple iteration), where the nth dimension corresponds to the nth letter of the n-gram. Note, that here typically non-letters should be mapped to the word-separator to reduce special cases.