Search code examples
pythonpandasfrequency

Find common elements in series of lists


I have a pandas series containing lists of tokens of strings. I want to find common elements among all the lists and along with their count (it must not be unique, bring all the elements with its count across the series). what I am currently doing is making a dictionary from pandas series and counting frequency of terms

ham_tokens = {}
for l in df_ham.tokens:
    for t in l:
        if ham_tokens.get(t):
            ham_tokens[t]+=1
        else:
            ham_tokens[t]=1

here is snapshot of my data

0  [we, have, difficulties, delivering, your, EMOTION, no, due, to, unpaid, shipping, freight, htpps, cuidaragora, php]
1  [costcoreward, your, EMOTION, cash, back, has, been, remunerated, sorry, for, the, delay, click]
2  [your, civil, verdict, has, been, finaiized, get, your, payment, by, URLBRAND, juristalawll, bch]
3  [need, quick, cash, get, up, to, cash, loan, in, minutes, no, credit, needed, same, day, funding, apply, now, reply, stop, to, remove]
4  [authmsg, BRAND, verification, is, dont, share, to, anyone, else, EMOTION, id, account, cannot, access, rightnow, bit, ly]

what I need is the a pandas method or any other efficient(loop-less) which can handle this problem.


Solution

  • As @Mustafa Aydın suggests you can use .explode() to create a pandas series containing all words. Then using .value_counts() you can count the number of occurances. Finally we can make a dictionary from this using dict():

    dict(df_series.explode().value_counts())
    

    For example:

    >>> df_series
    0       [a, b, c]
    1       [a, c, d]
    2    [q, c, b, c]
    Name: 0, dtype: object
    
    >>> df_series.explode()
    0    a
    0    b
    0    c
    1    a
    1    c
    1    d
    2    q
    2    c
    2    b
    2    c
    Name: 0, dtype: object
    
    >>> df_series.explode().value_counts()
    c    4
    a    2
    b    2
    d    1
    q    1
    Name: 0, dtype: int64
    
    >>> dict(df_series.explode().value_counts())
    {'c': 4, 'a': 2, 'b': 2, 'd': 1, 'q': 1}