Search code examples
pythonpandasnumpymatrixfrequency

Calculating the frequency of each word in the transition matrix, using numpy and pandas only


I am trying to calculate the frequency of each word in the transition matrix, using numpy and pandas only.

I have a string

star_wars = [('darth', 'leia'), ('luke', 'han'), ('chewbacca', 'luke'), 
         ('chewbacca', 'obi'), ('chewbacca', 'luke'), ('leia', 'luke')]

I build a matrix for this string, using this question.

             chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0     2    1
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

Now I am trying to convert these values of words into probabilities, using this question:

Using a crosstab works for the initial dataframe, but gives me pairs only

pd.crosstab(pd.Series(star_wars[1:]),
        pd.Series(star_wars[:-1]), normalize = 1)

Output is wrong and this also does not work for my created matrix, just an example:

col_0   (chewbacca, luke)   (chewbacca, obi)    (darth, leia)   (luke, han)
row_0               
(chewbacca, luke)   0.0 1.0 0.0 1.0
(chewbacca, obi)    0.5 0.0 0.0 0.0
(leia, luke)        0.5 0.0 0.0 0.0
(luke, han)         0.0 0.0 1.0 0.0

I also create a function

from itertools import islice

def my_function(seq, n = 2):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
    yield result
for elem in it:
    result = result[1:] + (elem,)
    yield result

Apply the function and calculate probabilities

pairs = pd.DataFrame(my_function(star_wars), columns=['Columns', 'Rows'])
counts = pairs.groupby('Columns')['Rows'].value_counts()
probs = (counts/counts.sum()).unstack()

print(probs)

But it gives me the calculation of pairs (not even sure it is correct)

Rows               (chewbacca, luke)  (chewbacca, obi)  (leia, luke)  \
Columns                                                                
(chewbacca, luke)                NaN               0.2           0.2   
(chewbacca, obi)                 0.2               NaN           NaN   
(darth, leia)                    NaN               NaN           NaN   
(luke, han)                      0.2               NaN           NaN   

Rows               (luke, han)  
Columns                         
(chewbacca, luke)          NaN  
(chewbacca, obi)           NaN  
(darth, leia)              0.2  
(luke, han)                NaN  

Another attempt, just using crosstab

Desired about - a matrix with probabilities, not numbers.

For example

            chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0   0.66 0.33
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

Appreciate your time and help!


Solution

  • We can still do it by crosstab

    df=pd.DataFrame(star_wars)
    s=pd.crosstab(df[0],df[1],normalize='index')
    s=s.reindex(index=df.stack().unique(),fill_value=0).reindex(columns=df.stack().unique(),fill_value=0)
    s
    1          darth  leia      luke  han  chewbacca       obi
    0                                                         
    darth          0   1.0  0.000000  0.0          0  0.000000
    leia           0   0.0  1.000000  0.0          0  0.000000
    luke           0   0.0  0.000000  1.0          0  0.000000
    han            0   0.0  0.000000  0.0          0  0.000000
    chewbacca      0   0.0  0.666667  0.0          0  0.333333
    obi            0   0.0  0.000000  0.0          0  0.000000