Search code examples
pythonnltkdefaultdictlanguage-model

get next word from bigram model on max probability


I want to generate sonnets using nltk with bigrams. I have generated bigrams and computed probability of each bigram and stored in default dict like that.

[('"Let', defaultdict(<function <lambda>.<locals>.<lambda> at0x1a17f98bf8>, 
{'the': 0.2857142857142857, 'dainty': 
0.14285714285714285, 'it': 0.14285714285714285, 'those': 
0.14285714285714285, 'me': 0.14285714285714285, 'us': 
0.14285714285714285}))]

Probability of each word appearing after let is given. Like that I have bigram model for my corpus. Now I want to generate 4 lines sonnet with 15 words in each line. I have tried this code but it is not working.

def generate_sonnet(word):
lines = 4
words= 15
for i in range(lines):
    line = ()
    for j in range(words):
   #I am selecting max probability but not that word. How I can select that word which has max probability of occurring with word?
        nword = float(max(model[word].values()))
        word += nword
        
word = random.choice(poetrylist)
generate_sonnet(word)

I select a random word and pass it to my function. where I want to join 15 words using bigrams and when 1 line completes the next 3 should be done.


Solution

  • here is a simple code snippet to show how this task can be achieved (with a very naive approach)

    bigram1 = {'Let' : {'the': 0.2857142857142857, 'dainty':
    0.14285714285714285, 'it': 0.14285714285714285, 'those':
    0.14285714285714285, 'me': 0.14285714285714285, 'us':
    0.14285714285714285}}
    
    bigram2 = {'the' : {'dogs' : 0.4, 'it' : 0.2, 'a' : 0.2, 'b': 0.2}}
    bigram3 = {'dogs' : {'out' : 0.6, 'it' : 0.2, 'jj' : 0.2}}
    
    model = {}
    model.update(bigram1)
    model.update(bigram2)
    model.update(bigram3)
    
    sentence = []
    
    iterations = 3
    word = 'Let'
    sentence.append(word)
    
    for _ in range(iterations):
        max_value = 0
        for k, v in model[word].iteritems():
            if v >= max_value:
                word = k
                max_value = v
        sentence.append(word)
    
    
    print(" ".join(sentence)) 
    

    output

    Let the dogs out
    

    the code is written in a very simple way and this is toy example for understanding proposes

    keep in mind, the word taken in the first word encountered with a max value thus this model is deterministic, consider adding random approach of choosing from a set of words which share the same max value

    I suggest to sample the words in proportion to their probabilities like so

    dist = {'the': 0.2857142857142857, 'dainty':
    0.14285714285714285, 'it': 0.14285714285714285, 'those':
    0.14285714285714285, 'me': 0.14285714285714285, 'us':
    0.14285714285714285}
    
    words = dist.keys()
    probabilities = dist.values()
    numpy.random.choice(words, p=probabilities)
    

    this will give you "random" word every time according to the distribution given

    smt like so (draft)

    for _ in range(iterations):
        word = np.random.choice(model[word].keys(), p=model[word].values())