I have ~100,000 lists of strings of the form:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145']
etc.
which essentially makes up my corpus. Each list contains the words from a document and their word counts.
How can I put this corpus into a form that I can feed into CountVectorizer?
Is there a quicker way than turning each list into a string containing 'the' 652 times, 'of' 216 times, etc.?
Assuming that what you're trying to achieve is a vectorized corpus in sparse matrix format, along with a trained vectorizer, you can simulate the vectorization process without repeating the data:
from scipy.sparse.lil import lil_matrix
from sklearn.feature_extraction.text import CountVectorizer
corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'],
['king: 20', 'of: 16', 'the: 400', 'jungle: 110']]
# Prepare a vocabulary for the vectorizer
vocabulary = {item.split(':')[0] for document in corpus for item in document}
indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)}
vectorizer = CountVectorizer(vocabulary=indexed_vocabulary)
# Vectorize the corpus using the coordinates known to the vectorizer
X = lil_matrix((len(corpus), len(vocabulary)))
X.data = [[int(item.split(':')[1]) for item in document] for document in corpus]
X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document]
for document in corpus]
# Convert the matrix to csr format to be compatible with vectorizer.transform output
X = X.tocsr()
In this example, the output will be:
[[ 168. 216. 0. 159. 652. 145. 0.]
[ 0. 16. 110. 0. 400. 0. 20.]]
This can allow further documents vectorization:
vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])
Which yields:
[[0 0 1 0 0 1 0]
[0 0 2 0 1 0 0]]