Search code examples
pythonpandaslistdataframecollections

Python get num occurrences of elements in each of several lists


I have a 4 corpuses:

C1 = ['hello','good','good','desk']
C2 = ['nice','good','desk','paper']
C3 = ['red','blue','green']
C4 = ['good']

I want to define a list of words, and for each - get the occurances per corpus. so if

l= ['good','blue']

I will get

res_df =  word. C1. C2. C3. C4
          good.  2. 1.  0.   1
          blue.  0. 0.  1.   0

My corpus is very large so I am looking for efficient way. What is the best way to do this?

Thanks


Solution

  • One idea is filter values by list converted to set and then count by Counter, last pass to DataFrame with add 0 and integers:

    from collections import Counter
    
    d = {'C1':C1, 'C2':C2, 'C3':C3, 'C4':C4}
    
    s = set(l)     
    
    df = (pd.DataFrame({k:Counter([y for y in v if y in s]) for k, v in d.items()})
            .fillna(0).astype(int))
    print (df)
          C1  C2  C3  C4
    good   2   1   0   1
    blue   0   0   1   0
    

    If possible not existing values in list:

    from collections import Counter
    
    l= ['good','blue','non']
    
    d = {'C1':C1, 'C2':C2, 'C3':C3, 'C4':C4}
    
    s = set(l)     
    
    df = (pd.DataFrame({k:Counter([y for y in v if y in s]) for k, v in d.items()})
            .fillna(0)
            .astype(int)
            .reindex(l, fill_value=0))
    print (df)
        
          C1  C2  C3  C4
    good   2   1   0   1
    blue   0   0   1   0
    non    0   0   0   0