Search code examples
pythonpython-3.xstatisticsfrequencyinformation-theory

INFORMATION THEORY: trying to create new texts using stochastic properties of other texts PYTHON


Recently, I have found Shannon's book and saw an interesting chapter: http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

enter image description here

What I would like to do is to create a function in python such that given a text and given an order, create an other text of length N with k-order approximation of the stochastic properties of a text:

def create_text(text, N, n)

However, I do not know the algorithm to implement such a function. I suppose that I should count the frequences of the letters, but how do I correlate some letters with others according to a given order?

MY ATTEMPT:

Counting frequences of one letter it is easy. I started counting frequencies of two letters (I think they are called bigrams). Well, I implemented a recursive function, which seems to work with small tetxs, but with bigger ones it just exceeds the recursion depth. When I tried to change the limit of the recursion depth, then it just hanged. However, with small texts it almost works:

# Input: text, letter, (recursive argument), dictionary for bigrams
def find_next (txt, x, index_txt, dic):
    total = txt.count(x)
    current = txt.find(x, index_txt)
    if current == -1: return
    print (current, txt[current+1])
    # Set dictionary
    keys = dic.keys()
    index_dic = x+txt[current+1]
    if index_dic in keys:
        dic[index_dic] += 1/total
    else:
        dic[index_dic] = 1/total
    find_next(txt, x, current+1, dic)

This function counts the frequency of appearance of letter x+another one. However, when the text ends in x, the function fails becuase of the out of range. I see that it is in line txt[current+1] the error, but I cannot figure out how to loop back. I mean, when I reach the last letter, go back to the first and end.

Example:

f = "this is amazing"

If I lauch find_next (f, 'g', 0, {}) the function fails at the end, because I don't know how to connect g to the first letter t.

I suppose this can be solved. But then, when I calculate the frequencies of unique letters (a, b, c, ...) and then bigrams (ae, ar, at, bg, bj, ...), what next..? Trigrams and so on? I cannot figure out the algorithm.

Thanks in advance


Solution

  • The typical approach to this problem is as follows:

    Analysis part

    Create a n-dimensional array for counting the n-grams of an existing text (no need for recursion here, this is a simple iteration), where the nth dimension corresponds to the nth letter of the n-gram. Note, that here typically non-letters should be mapped to the word-separator to reduce special cases.

    Generation part

    • For each new letter to be generated, create a simple probability table for the next character given the last n-1 letters, which were already generated.
    • Generate a random number to choose one of the possible next characters.