Search code examples
pythonnumpytexttransitionprediction

Selecting a word based on transition matrix weights in python


I am trying to select the likely next word based on the current word, using previous word pair occurrences as "weights". I am having trouble implementing np.random.choice() in the actual choice of the next word.

import pandas as pd
import numpy as np

texty = "won't you celebrate with me what i have shaped into a kind of life i had no model born in babylon both nonwhite and woman what did i see to be except myself i made it up here on this bridge between starshine and clay my one hand holding tight my other hand come celebrate with me that everyday
something has tried to kill me and has failed." 

# https://www.poetryfoundation.org/poems/50974/wont-you-celebrate-with-me

words = texty.split()

# Creating the text-based transition matrix

x = pd.crosstab(pd.Series(words[1:],name='next'),
            pd.Series(words[:-1],name='word'),normalize=1)

print(x)

# Selecting the next word based on the current word.
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

current = "and"

# this part isn't working--->
next = np.random.choice(current,1,current) # was "y"

I don't know how to refer to the transition matrix from here. I would like this choice to be based on the probabilities of previous occurrences. For example, the probability of "clay" following "and" is 33%.


Solution

  • x is a Pandas DataFrame.

    You can access any of the columns of that DataFrame as if the column names were keys into a dictionary.

    > print(x['won\'t'])
    next
    a            0.0
    and          0.0
    babylon      0.0
    ...
    with         0.0
    woman        0.0
    you          1.0
    Name: won't, dtype: float64
    

    The column returns as a Pandas Series. If you select a column from the DataFrame (your transition matrix x), the index of the Series you select will be available words from the text, and the values will be their associated probabilities. You can provide each of these to np.random.choice to get the next word, with probabilities weighted from your transition matrix.

    > current_word = 'won\'t'
    > current_column = x[current_word]
    > next_word = np.random.choice(current_column.index,
                     p=current_column.values)
    > print(next_word)
    you