Search code examples
pythonscipysparse-matrix

How to convert co-occurrence matrix to sparse matrix


I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):

def matrix(from_corpus):    
d = defaultdict(lambda : defaultdict(int))
        heads = set() 
        trans = set()
        for text in corpus:
            d[text[0]][text[1]] += 1
            heads.add(text[0])
            trans.add(text[1])

        return d,heads,trans

My idea would be to make a new function:

def matrix_to_sparse(d):
    A = sparse.lil_matrix(d)

Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.

It would be nice if some could put me in the direction.


Solution

  • Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):

    vocabulary = {}  # map terms to column indices
    data = []        # values (maybe weights)
    row = []         # row (document) indices
    col = []         # column (term) indices
    
    for i, doc in enumerate(documents):
        for term in doc:
            # get column index, adding the term to the vocabulary if needed
            j = vocabulary.setdefault(term, len(vocabulary))
            data.append(1)  # uniform weights
            row.append(i)
            col.append(j)
    
    A = scipy.sparse.coo_matrix((data, (row, col)))
    

    Now, to get a cooccurrence matrix:

    A.T * A
    

    (ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).

    Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)