Search code examples
pythonpandasword-frequency

How do I get the most frequent words in a column of text based on the value of another column?


I have a dataset of tweets and the year they were posted. I want to get a count of the most frequently occurring words each year. My dataset looks like this:

year     tweet
2015     my car is blue
2015     mom is making dinner
2016     my hair is red
2016     i love my mom

I only know how to get the most frequently occurring words for the entire dataset:

pd.Series(' '.join(df['tweets']).split()).value_counts()

Which would give me this:

my      3
is      3
mom     2
car     1
blue    1
making  1
dinner  1
hair    1
red     1
i       1
love    1

So how would I get something like this?

2015

is      2
my      1
car     1
blue    1
mom     1
making  1
dinner  1

2016

my      2
hair    1
is      1
red     1
i       1
love    1
mom     1

Solution

  • I'd do something like this:

    counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)
    

    Output:

    >>> counts
    year        
    2015  is        2
          my        1
          car       1
          blue      1
          mom       1
          making    1
          dinner    1
    2016  my        2
          hair      1
          is        1
          red       1
          i         1
          love      1
          mom       1
    Name: tweet, dtype: int6
    

    To get the top, say, 5 items per year:

    df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))