Search code examples
pythonpandasword-frequency

How to calculate most frequently occurring words in pandas dataframe column by year?


I have a pandas dataframe that contains a column 'reviews' and a column 'year'. I would like to view the top 100 most frequently occurring words in the reviews column, but filtered by year. So, I want to know the top 100 from 2002, 2003, 2004, and so on, all the way up to 2017.

import pandas as pd
from nltk.corpus import stopwords

df=pd.read_csv('./reviews.csv')

stop = stopwords.words('english')

commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]

print(commonwords)

df.to_csv('commonwords.csv', index=False)

The above code works but it only gives me the top 100 most frequently occurring words across all years.


Solution

  • You can use:

    df = pd.DataFrame({'reviews':['He writer in me great great me',
                            'great ambience the coffee was great',
                            'great coffee'],
                       'year':[2002,2004,2004]})
    print (df)
    
                                   reviews  year
    0       He writer in me great great me  2002
    1  great ambience the coffee was great  2004
    2                         great coffee  2004
    
    #change for 100 for top100 in real data
    N = 3
    df1 =  (df.set_index('year')['reviews']
              .str.lower()
              .str.split(expand=True)
              .stack()
              .groupby(level=0)
              .value_counts()
              .groupby(level=0)
              .head(N)
              .rename_axis(('year','words'))
              .reset_index(name='count'))
    
    print (df1)
       year     words  count
    0  2002     great      2
    1  2002        me      2
    2  2002        he      1
    3  2004     great      3
    4  2004    coffee      2
    5  2004  ambience      1
    

    Explanation:

    1. Convert values to lowercase by Series.str.lower with Series.str.split for DataFrame
    2. Reshape by DataFrame.stack for MultiIndex Series
    3. count values per groups with SeriesGroupBy.value_counts, values are sorted
    4. get top N values by GroupBy.head
    5. Data cleaning - DataFrame.rename_axis with DataFrame.reset_index