I'm using CountVectorizer to create a sparse matrix representation of a co-occurrence matrix.
I have a list of sentences, and I have another list (vector) of "weights" - the number of times I'd like each sentences tokens to be counted.
It's possible to create a list with each sentence repeated many times according to its relevant weight, but this is terribly inefficient and un-pythonic. Some of my weights are in the millions and up.
How can I efficiently tell CountVectorizer to use the weight vector I have?
As there's no way (that I could find) to apply weights to each sentence supplied to countvectorizer, it is possible to multiply the resulting sparse matrix.
cv = CountVectorizer(lowercase = False, min_df=0.001, tokenizer = space_splitter)
X = cv.fit_transform(all_strings)
# Multiply the resulting sparse matrix by the weight (count) of each sentence.
counts = scipy.sparse.diags(df.weight, 0)
X = (X.T*counts).T
Xc = (X.T * X) # create co-occurance matrix
Notice the matrix you multiply by has to be a sparse matrix and the weights need to be in its diagonal.