Search code examples
pythonpandasword-frequency

Getting top 100 words with highest document frequency in a pandas series


Suppose I have a pandas series like this:

0  "sun moon earth moon"
1  "sun saturn mercury saturn"
2  "sun earth mars"
3  "sun earth saturn sun saturn"

I want to get the top 3 words with the highest row ("document") frequency irrespective of the frequency within a single row ("document").

For overall frequency I can just collect all the words from all rows in a string, do a split, convert back to series and use value_counts. In that case, the top 3 frequencies would be:

1. sun:    5
2. saturn: 4
3. earth:  3

But the document frequencies, i.e. the number of rows in which a word occurs, would be

1. sun:    4
2. earth:  3
3. saturn: 2

A way I can think of off the top of my head is to apply a lambda function to the series, splitting each string, making a set out of it, then combining all words into a single set, making a series out of that and then using value_counts. Is there a more efficient way of achieving the same thing?


Solution

  • Because performance is important use Counter:

    from collections import Counter
    
    a = Counter([y for x in s for y in x.split()]).most_common(3)
    print (a)
    [('sun', 5), ('saturn', 4), ('earth', 3)]
    
    b = Counter([y for x in s for y in set(x.split())]).most_common(3)
    print (b)
    [('sun', 4), ('earth', 3), ('saturn', 2)]
    
    df1 = pd.DataFrame(a, columns=['val','count'])
    #df2 = pd.DataFrame(b, columns=['val','count'])
    print (df1)
          val  count
    0     sun      5
    1  saturn      4
    2   earth      3
    

    Pandas alternatives:

    a = s.str.split(expand=True).stack().value_counts().head(3)
    print (a)
    sun       5
    saturn    4
    earth     3
    dtype: int64
    
    b = (s.str.split(expand=True)
          .stack()
          .reset_index(name='val')
          .drop_duplicates(['val', 'level_0'])['val']
          .value_counts()
          .head(3))
    print (b)
    sun       4
    earth     3
    saturn    2
    Name: val, dtype: int64