Search code examples
pythonscikit-learnnlpcountvectorizer

How to apply weights to sentences in CountVectorizer (count each sentences tokens several times)


I'm using CountVectorizer to create a sparse matrix representation of a co-occurrence matrix.

I have a list of sentences, and I have another list (vector) of "weights" - the number of times I'd like each sentences tokens to be counted.

It's possible to create a list with each sentence repeated many times according to its relevant weight, but this is terribly inefficient and un-pythonic. Some of my weights are in the millions and up.

How can I efficiently tell CountVectorizer to use the weight vector I have?


Solution

  • As there's no way (that I could find) to apply weights to each sentence supplied to countvectorizer, it is possible to multiply the resulting sparse matrix.

    cv = CountVectorizer(lowercase = False, min_df=0.001, tokenizer = space_splitter)
    X = cv.fit_transform(all_strings)
    
    # Multiply the resulting sparse matrix by the weight (count) of each sentence.
    counts = scipy.sparse.diags(df.weight, 0)
    X = (X.T*counts).T
    Xc = (X.T * X) # create co-occurance matrix
    

    Notice the matrix you multiply by has to be a sparse matrix and the weights need to be in its diagonal.