Search code examples
python-3.xdictionaryscipysparse-matrix

Access element from csr_matrix


I have created a Sparse matrix using the Scipy dok_matrix method as follows:

sparse_dtm = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
for doc_index, document in enumerate(data_list):
    document_counter = Counter(document)
    for word in set(document):
        sparse_dtm[doc_index, word_index[word]] = document_counter[word]

Where data_list is a list of lists with tokenized texts.

After having created sparse_dtm, I would like to retrieve all values for the first row.

From the documentation I know that I can use the .getrow() method to get all elements from row i.

However, so far I am unable to retrieve the keys/values stored in the csr_matrix:

sparse_dtm.getrow(0).keys()
AttributeError: keys not found

sparse_dtm.getrow(0)[0]
<1x90140 sparse matrix of type '<class 'numpy.float32'>'
    with 576 stored elements in Compressed Sparse Row format>

sparse_dtm does contain the right information though:

print(sparse_dtm.getrow(0))
Output: (0, 21018)    6.0
        (0, 76741)    3.0
        (0, 14008)    1.0
        (0, 54143)    2.0
        (0, 11866)    1.0
        ...
  

How can I access elements from row i and retrieve its keys and values?


Solution

  • To obtain the values:

    sparse_p_ij = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
    row_zero = self.sparse_dtm.getrow(0).toarray()[0]
    

    This provides all the values. To obtain the keys for each value, take the index of a non-zero value:

    indices = row_zero.nonzero()[0]
    

    Then feed these values to index_to_word, which I have created as follows:

    word_to_index = dict()
    index_to_word = dict()
    
    for i, word in enumerate(vocabulary):
        word_to_index[word] = i
        index_to_word[i] = word
    

    Where vocabulary is a set of all the words in the corpus.