I have created a Sparse matrix using the Scipy
dok_matrix
method as follows:
sparse_dtm = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
for doc_index, document in enumerate(data_list):
document_counter = Counter(document)
for word in set(document):
sparse_dtm[doc_index, word_index[word]] = document_counter[word]
Where data_list
is a list of lists with tokenized texts.
After having created sparse_dtm
, I would like to retrieve all values for the first row.
From the documentation I know that I can use the .getrow()
method to get all elements from row i
.
However, so far I am unable to retrieve the keys/values stored in the csr_matrix
:
sparse_dtm.getrow(0).keys()
AttributeError: keys not found
sparse_dtm.getrow(0)[0]
<1x90140 sparse matrix of type '<class 'numpy.float32'>'
with 576 stored elements in Compressed Sparse Row format>
sparse_dtm
does contain the right information though:
print(sparse_dtm.getrow(0))
Output: (0, 21018) 6.0
(0, 76741) 3.0
(0, 14008) 1.0
(0, 54143) 2.0
(0, 11866) 1.0
...
How can I access elements from row i
and retrieve its keys and values?
To obtain the values:
sparse_p_ij = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
row_zero = self.sparse_dtm.getrow(0).toarray()[0]
This provides all the values. To obtain the keys for each value, take the index of a non-zero value:
indices = row_zero.nonzero()[0]
Then feed these values to index_to_word
, which I have created as follows:
word_to_index = dict()
index_to_word = dict()
for i, word in enumerate(vocabulary):
word_to_index[word] = i
index_to_word[i] = word
Where vocabulary
is a set of all the words in the corpus.