I have a pandas dataframe that contains a column 'reviews' and a column 'year'. I would like to view the top 100 most frequently occurring words in the reviews column, but filtered by year. So, I want to know the top 100 from 2002, 2003, 2004, and so on, all the way up to 2017.
import pandas as pd
from nltk.corpus import stopwords
df=pd.read_csv('./reviews.csv')
stop = stopwords.words('english')
commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]
print(commonwords)
df.to_csv('commonwords.csv', index=False)
The above code works but it only gives me the top 100 most frequently occurring words across all years.
You can use:
df = pd.DataFrame({'reviews':['He writer in me great great me',
'great ambience the coffee was great',
'great coffee'],
'year':[2002,2004,2004]})
print (df)
reviews year
0 He writer in me great great me 2002
1 great ambience the coffee was great 2004
2 great coffee 2004
#change for 100 for top100 in real data
N = 3
df1 = (df.set_index('year')['reviews']
.str.lower()
.str.split(expand=True)
.stack()
.groupby(level=0)
.value_counts()
.groupby(level=0)
.head(N)
.rename_axis(('year','words'))
.reset_index(name='count'))
print (df1)
year words count
0 2002 great 2
1 2002 me 2
2 2002 he 1
3 2004 great 3
4 2004 coffee 2
5 2004 ambience 1
Explanation:
Series.str.lower
with Series.str.split
for DataFrame
DataFrame.stack
for MultiIndex Series
SeriesGroupBy.value_counts
, values are sortedN
values by GroupBy.head
DataFrame.rename_axis
with DataFrame.reset_index