Search code examples
pythoncsr

How to convert a dataframe with string in columns into csr_matrix


I working on a PMI problem, so far I have a dataframe like this:

w = ['by', 'step', 'by', 'the', 'is', 'step', 'is', 'by', 'is']
c = ['step', 'what', 'is', 'what', 'the', 'the', 'step', 'the', 'what']
ppmi = [1, 3, 12, 3, 123, 1, 321, 1, 23]
df = pd.DataFrame({'w':w, 'c':c, 'ppmi': ppmi})

I want to convert this dataframe into a sparse matrix. Since w and c are lists of strings, if I do csr_matrix((ppmi, (w, c))), it will give me an error TypeError: cannot perform reduce with flexible type. What is another way to convert this dataframe?


Solution

  • Maybe you could try with coo_matrix:

    import pandas as pd
    import scipy.sparse as sps
    w = ['by', 'step', 'by', 'the', 'is', 'step', 'is', 'by', 'is']
    c = ['step', 'what', 'is', 'what', 'the', 'the', 'step', 'the', 'what']
    ppmi = [1, 3, 12, 3, 123, 1, 321, 1, 23]
    df = pd.DataFrame({'w':w, 'c':c, 'ppmi': ppmi})
    df.set_index(['w', 'c'], inplace=True)
    mat = sps.coo_matrix((df['ppmi'],(df.index.labels[0], df.index.labels[1])))
    print(mat.todense())
    

    output:

    [[ 12   1   1   0]
     [  0 321 123  23]
     [  0   0   1   3]
     [  0   0   0   3]]