I have created a sparse representation of data and want to convert this into a Numpy array.
Let's say, I have the following data (in practice data
contains many more lists and each list is much longer):
data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]
And I have two dict
items that maps each word to an unique integer value and vice versa:
w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}
Furthermore, I have a dict
that gets the count for each word combination:
comb_dict = dict()
for text in data:
sorted_set_text = sorted(list(set(text)))
for i in range(len(sorted_set_text)-1):
for j in range(i+1, len(sorted_set_text)):
if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
else:
comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1
From this dict, I create a sparse representation as follows:
sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]
This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:
[(4, 3, 1),
(4, 5, 1),
(4, 1, 1),
(4, 2, 1),
(4, 0, 1),
(3, 5, 1),
(3, 1, 1),
(3, 2, 1),
(3, 0, 1),
(5, 1, 2),
(5, 2, 1),
(5, 0, 1),
(1, 2, 1),
(1, 0, 1),
(2, 0, 1),
(7, 6, 1),
(7, 5, 1),
(7, 1, 1),
(7, 8, 2),
(6, 5, 1),
(6, 1, 1),
(6, 8, 1),
(5, 8, 1),
(1, 8, 1),
(9, 7, 1),
(9, 8, 1),
(9, 10, 1),
(7, 10, 1),
(8, 10, 1)]
Now, I want to get a Numpy array
(11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be
cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)
Then, I want to update cooc
so that the row/column indices associated with the word combinations in sparse
will be assigned the associated value. How can I do this?
EDIT: I am aware that I can loop through cooc
and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc
into a Scipy sparse matrix and use the toarray()
method. How can I do this?
I think these other answers are kinda reinventing a wheel that already exists.
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]
I'm going to put these back together and just use sklearn's CountVectorizer
data = [" ".join(x) for x in data]
encoder = CountVectorizer()
occurrence = encoder.fit_transform(data)
This occurrence matrix is a sparse matrix, and turning it into a co-occurrence matrix is just a simple multiplication (the diagonal is the total number of times each token appears).
co_occurrence = occurrence.T @ occurrence
>>> co_occurrence.A
array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
[0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
[0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
[1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])
And the row/column labels can be recovered from the encoder:
encoder.vocabulary_
{'this': 9,
'is': 6,
'my': 7,
'first': 4,
'dataset': 3,
'here': 5,
'but': 2,
'another': 1,
'one': 8,
'and': 0,
'yet': 10}