How to apply weights to sentences in CountVectorizer (count each sentences tokens several times)

I'm using CountVectorizer to create a sparse matrix representation of a co-occurrence matrix.

I have a list of sentences, and I have another list (vector) of "weights" - the number of times I'd like each sentences tokens to be counted.

It's possible to create a list with each sentence repeated many times according to its relevant weight, but this is terribly inefficient and un-pythonic. Some of my weights are in the millions and up.

How can I efficiently tell CountVectorizer to use the weight vector I have?

Solution

As there's no way (that I could find) to apply weights to each sentence supplied to countvectorizer, it is possible to multiply the resulting sparse matrix.

cv = CountVectorizer(lowercase = False, min_df=0.001, tokenizer = space_splitter)
X = cv.fit_transform(all_strings)

# Multiply the resulting sparse matrix by the weight (count) of each sentence.
counts = scipy.sparse.diags(df.weight, 0)
X = (X.T*counts).T
Xc = (X.T * X) # create co-occurance matrix

Notice the matrix you multiply by has to be a sparse matrix and the weights need to be in its diagonal.