I have a list of lists of strings (Essentially it's a corpus) and I'd like to convert it to a matrix where a row is a document in the corpus and the columns are the corpus' vocabulary.
I can do this with CountVectorizer
but it would require quite a lot of memory as I would need to convert each list into a string that in turn CountVectorizer
would tokenize.
I think it's possible to do it with Pandas only but I'm not sure how.
Example:
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
expected result:
| a | b | c |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 0 | 0 |
| 0 | 1 | 2 |
I would combine collections.Counter
and the DataFrame
constructor:
from collections import Counter
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
df = pd.DataFrame(map(Counter, corpus)).fillna(0, downcast='infer')
Output:
a b c
0 1 1 1
1 2 0 0
2 0 1 2