I have dataframe
I am struggling to show the top 10 words in a bar chart for all tweets, real tweets and fake. Any suggestions?
Divide all texts into words, count the frequencies, select the 10 most frequent ones and plot them. I thought something like this could work but as a novice I'm unsure how to implement this.
pandas.Series.explode
to separate all the values in a list
to separate rows..groupby
and aggregate .count
on the values in the column, and then .sort_values
pandas.DataFrame.plot.bar
to plot the wordsimport pandas as pd
import matplotlib.pyplot as plt
# test dataframe
df = pd.DataFrame({'lemmatized': [['se', 'acuerdan', 'de', 'la', 'pelicula el', 'dia'], ['milenagimon', 'miren', 'sandy', 'en', 'ny', 'tremenda'], ['se', 'acuerdan', 'de']]})
# display(df)
lemmatized
0 [se, acuerdan, de, la, pelicula el, dia]
1 [milenagimon, miren, sandy, en, ny, tremenda]
2 [se, acuerdan, de]
# use explode to expand the lists into separate rows
dfe = df.lemmatized.explode().to_frame().reset_index(drop=True)
# groupby the values in the column, get the count and sort
dfg = dfe.groupby('lemmatized').lemmatized.count() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(10).reset_index(drop=True)
# display(dfg)
lemmatized count
0 acuerdan 2
1 de 2
2 se 2
3 dia 1
4 en 1
5 la 1
6 milenagimon 1
7 miren 1
8 ny 1
9 pelicula el 1
# plot the dataframe
dfg.plot.bar(x='lemmatized')
.value_counts
instead of .groupby
# use value_counts and plot the series
dfe.lemmatized.value_counts().head(10).plot.bar()
seaborn.countplot
import seaborn as sns
# plot dfe
sns.countplot(x='lemmatized', data=dfe, order=dfe.lemmatized.value_counts().iloc[:10].index)
plt.xticks(rotation=90)