I have a 4 corpuses:
C1 = ['hello','good','good','desk']
C2 = ['nice','good','desk','paper']
C3 = ['red','blue','green']
C4 = ['good']
I want to define a list of words, and for each - get the occurances per corpus. so if
l= ['good','blue']
I will get
res_df = word. C1. C2. C3. C4
good. 2. 1. 0. 1
blue. 0. 0. 1. 0
My corpus is very large so I am looking for efficient way. What is the best way to do this?
Thanks
One idea is filter values by list converted to set and then count by Counter
, last pass to DataFrame with add 0
and integers:
from collections import Counter
d = {'C1':C1, 'C2':C2, 'C3':C3, 'C4':C4}
s = set(l)
df = (pd.DataFrame({k:Counter([y for y in v if y in s]) for k, v in d.items()})
.fillna(0).astype(int))
print (df)
C1 C2 C3 C4
good 2 1 0 1
blue 0 0 1 0
If possible not existing values in list:
from collections import Counter
l= ['good','blue','non']
d = {'C1':C1, 'C2':C2, 'C3':C3, 'C4':C4}
s = set(l)
df = (pd.DataFrame({k:Counter([y for y in v if y in s]) for k, v in d.items()})
.fillna(0)
.astype(int)
.reindex(l, fill_value=0))
print (df)
C1 C2 C3 C4
good 2 1 0 1
blue 0 0 1 0
non 0 0 0 0