Search code examples
pythonscipysparse-matrix

pointers in sparse matrix in python scipy


I am trying to understand sparse matrix in scipy especially the csr_matrix format

Suppose I have following texts

 docs = ['hello  world hello', 'goodbye cruel world']

I tokenize them and get a list of dictionaries with token occurences and a dictionary with token_ids.

ids_token = {0: 'world', 1: 'hello', 2: 'cruel', 3: 'goodbye'}
token_counts = [{0: 1, 1: 2}, {0: 1, 2: 1, 3: 1}]

How can I transform the token_counts in csr_matrix ?

Here is what I tried so far:

data = [item for sublist in token_counts for item in sublist.values()]
print 'data:', data

indices = [item for sublist in token_counts for item in sublist.keys()]
print 'indices:', indices 

indptr  = [0] + [len(item) for item in token_counts]
print 'pointers:', indptr

#now I create the matrix 
sp_matrix = csr_matrix((data, indices, indptr), dtype=int)
print sp_matrix.toarray()

import pandas as pd 
pd.DataFrame(sp_matrix.toarray().transpose(), index = ids_token.values())

the results is not what expect, which zeros in the last rows.

I suspect that the problem is in the pointer indptr, what am I missing ?

any help appreciated

updated this is what I would like to get

       doc0  doc11
cruel   0   1
goodbye 0   1
hello   2   0
world   1   1

P.S: the example is taken from the scipy documentation


Solution

  • It would help if you gave a sample matrix; what you are trying to produce.

    Generally we don't try to specify the csr values directly. The indptr value in particular is a bit obscure. The coo style of inputs in generally better, (Data_array, (i_array, j_array)), where M[i,j] = data. sparse automatically converts that to the csr format.

    dok format is also convenient. There the matrix is stored as a dictionary, with the tuple (i,j) is the key.

    In [151]: data = [item for sublist in token_counts for item in sublist.values()] 
    In [152]: rows = [item for sublist in token_counts for item in sublist.keys()]
    In [153]: cols = [i for i,sublist in enumerate(token_counts) for item in sublist.keys()]
    In [155]: M=sparse.csr_matrix((data,(rows,cols)))
    In [156]: M
    Out[156]: 
    <4x2 sparse matrix of type '<class 'numpy.int32'>'
        with 5 stored elements in Compressed Sparse Row format>
    In [157]: M.A
    Out[157]: 
    array([[1, 1],
           [2, 0],
           [0, 1],
           [0, 1]], dtype=int32)
    

    Look at the attributes of M to see how you could construct it with the indptr format:

    In [158]: M.data
    Out[158]: array([1, 1, 2, 1, 1], dtype=int32)
    In [159]: M.indices
    Out[159]: array([0, 1, 0, 1, 1], dtype=int32)
    In [160]: M.indptr
    Out[160]: array([0, 2, 3, 4, 5], dtype=int32)
    

    The str display of a sparse matrix enumerates the nonzero elements (a dok format would look like this internally).

    In [161]: print(M)
      (0, 0)    1
      (0, 1)    1
      (1, 0)    2
      (2, 1)    1
      (3, 1)    1