Search code examples
pythonarraysnumpysparse-matrix

Create Numpy array from sparse representation


I have created a sparse representation of data and want to convert this into a Numpy array.

Let's say, I have the following data (in practice data contains many more lists and each list is much longer):

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

And I have two dict items that maps each word to an unique integer value and vice versa:

w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}

Furthermore, I have a dict that gets the count for each word combination:

comb_dict = dict()
for text in data:
    sorted_set_text = sorted(list(set(text)))
    for i in range(len(sorted_set_text)-1):
        for j in range(i+1, len(sorted_set_text)):
            if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
            else:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1

From this dict, I create a sparse representation as follows:

sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]

This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:

[(4, 3, 1),
 (4, 5, 1),
 (4, 1, 1),
 (4, 2, 1),
 (4, 0, 1),
 (3, 5, 1),
 (3, 1, 1),
 (3, 2, 1),
 (3, 0, 1),
 (5, 1, 2),
 (5, 2, 1),
 (5, 0, 1),
 (1, 2, 1),
 (1, 0, 1),
 (2, 0, 1),
 (7, 6, 1),
 (7, 5, 1),
 (7, 1, 1),
 (7, 8, 2),
 (6, 5, 1),
 (6, 1, 1),
 (6, 8, 1),
 (5, 8, 1),
 (1, 8, 1),
 (9, 7, 1),
 (9, 8, 1),
 (9, 10, 1),
 (7, 10, 1),
 (8, 10, 1)]

Now, I want to get a Numpy array (11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be

cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)

Then, I want to update cooc so that the row/column indices associated with the word combinations in sparse will be assigned the associated value. How can I do this?

EDIT: I am aware that I can loop through cooc and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc into a Scipy sparse matrix and use the toarray() method. How can I do this?


Solution

  • I think these other answers are kinda reinventing a wheel that already exists.

    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer 
    
    data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]
    

    I'm going to put these back together and just use sklearn's CountVectorizer

    data = [" ".join(x) for x in data]
    encoder = CountVectorizer()
    occurrence = encoder.fit_transform(data)
    

    This occurrence matrix is a sparse matrix, and turning it into a co-occurrence matrix is just a simple multiplication (the diagonal is the total number of times each token appears).

    co_occurrence = occurrence.T @ occurrence
    
    >>> co_occurrence.A
    
    array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
           [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
           [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
           [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
           [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
           [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
           [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
           [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
           [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
           [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
           [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])
    

    And the row/column labels can be recovered from the encoder:

    encoder.vocabulary_
    
    {'this': 9,
     'is': 6,
     'my': 7,
     'first': 4,
     'dataset': 3,
     'here': 5,
     'but': 2,
     'another': 1,
     'one': 8,
     'and': 0,
     'yet': 10}