Create Numpy array from sparse representation

I have created a sparse representation of data and want to convert this into a Numpy array.

Let's say, I have the following data (in practice data contains many more lists and each list is much longer):

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

And I have two dict items that maps each word to an unique integer value and vice versa:

w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}

Furthermore, I have a dict that gets the count for each word combination:

comb_dict = dict()
for text in data:
    sorted_set_text = sorted(list(set(text)))
    for i in range(len(sorted_set_text)-1):
        for j in range(i+1, len(sorted_set_text)):
            if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
            else:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1

From this dict, I create a sparse representation as follows:

sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]

This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:

[(4, 3, 1),
 (4, 5, 1),
 (4, 1, 1),
 (4, 2, 1),
 (4, 0, 1),
 (3, 5, 1),
 (3, 1, 1),
 (3, 2, 1),
 (3, 0, 1),
 (5, 1, 2),
 (5, 2, 1),
 (5, 0, 1),
 (1, 2, 1),
 (1, 0, 1),
 (2, 0, 1),
 (7, 6, 1),
 (7, 5, 1),
 (7, 1, 1),
 (7, 8, 2),
 (6, 5, 1),
 (6, 1, 1),
 (6, 8, 1),
 (5, 8, 1),
 (1, 8, 1),
 (9, 7, 1),
 (9, 8, 1),
 (9, 10, 1),
 (7, 10, 1),
 (8, 10, 1)]

Now, I want to get a Numpy array (11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be

cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)

Then, I want to update cooc so that the row/column indices associated with the word combinations in sparse will be assigned the associated value. How can I do this?

EDIT: I am aware that I can loop through cooc and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc into a Scipy sparse matrix and use the toarray() method. How can I do this?

Solution

I think these other answers are kinda reinventing a wheel that already exists.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

I'm going to put these back together and just use sklearn's CountVectorizer

data = [" ".join(x) for x in data]
encoder = CountVectorizer()
occurrence = encoder.fit_transform(data)

This occurrence matrix is a sparse matrix, and turning it into a co-occurrence matrix is just a simple multiplication (the diagonal is the total number of times each token appears).

co_occurrence = occurrence.T @ occurrence

>>> co_occurrence.A

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])

And the row/column labels can be recovered from the encoder:

encoder.vocabulary_

{'this': 9,
 'is': 6,
 'my': 7,
 'first': 4,
 'dataset': 3,
 'here': 5,
 'but': 2,
 'another': 1,
 'one': 8,
 'and': 0,
 'yet': 10}