Search code examples
pythonpandasmatplotlibtwitter

How to plot word frequency, from a column of lists, in a bar chart


I have dataframe

1

I am struggling to show the top 10 words in a bar chart for all tweets, real tweets and fake. Any suggestions?

Divide all texts into words, count the frequencies, select the 10 most frequent ones and plot them. I thought something like this could work but as a novice I'm unsure how to implement this.


Solution

    • The primary requirement is to use pandas.Series.explode to separate all the values in a list to separate rows.
    • .groupby and aggregate .count on the values in the column, and then .sort_values
    • Use pandas.DataFrame.plot.bar to plot the words
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # test dataframe
    df = pd.DataFrame({'lemmatized': [['se', 'acuerdan', 'de', 'la', 'pelicula el', 'dia'], ['milenagimon', 'miren', 'sandy', 'en', 'ny', 'tremenda'], ['se', 'acuerdan', 'de']]})
    
    # display(df)
                                          lemmatized
    0       [se, acuerdan, de, la, pelicula el, dia]
    1  [milenagimon, miren, sandy, en, ny, tremenda]
    2                             [se, acuerdan, de]
    
    # use explode to expand the lists into separate rows
    dfe = df.lemmatized.explode().to_frame().reset_index(drop=True)
    
    # groupby the values in the column, get the count and sort
    dfg = dfe.groupby('lemmatized').lemmatized.count() \
                                   .reset_index(name='count') \
                                   .sort_values(['count'], ascending=False) \
                                   .head(10).reset_index(drop=True)
    
    # display(dfg)
        lemmatized  count
    0     acuerdan      2
    1           de      2
    2           se      2
    3          dia      1
    4           en      1
    5           la      1
    6  milenagimon      1
    7        miren      1
    8           ny      1
    9  pelicula el      1
    
    # plot the dataframe
    dfg.plot.bar(x='lemmatized')
    

    enter image description here

    Alternative Implementations

    • Use .value_counts instead of .groupby
    # use value_counts and plot the series
    dfe.lemmatized.value_counts().head(10).plot.bar()
    
    • Using seaborn.countplot
    import seaborn as sns
    
    # plot dfe
    sns.countplot(x='lemmatized', data=dfe, order=dfe.lemmatized.value_counts().iloc[:10].index)
    plt.xticks(rotation=90)
    

    enter image description here