Search code examples
pythonpandasdataframegroup-bydistinct-values

Group by and Count distinct words in Pandas DataFrame


By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.

This is the sample code:

source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'], 
                  'Year' : ['1999', '2000', '2000','2000'],
                  'Message' : [

'I Love You','Will Remember You','Love','I Love You]})

Excepted results are the following in a dataframe. Any ideas?

Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1

2000 Mike Will 1 
2000 Mike Remember 1
2000 Mike You 1 
2000 John Love 2
2000 John I 1
2000 John You 1

Solution

  • I think you can first split column Message, create Serie and add it to original source. Last groupby with size:

    #split column Message to new df, create Serie by stack
    s = (source.Message.str.split(expand=True).stack())
    #remove multiindex
    s.index = s.index.droplevel(-1)
    s.name= 'Message'
    print(s)
    0           I
    0        Love
    0         You
    1        Will
    1    Remember
    1         You
    2        Love
    3           I
    3        Love
    3         You
    Name: Message, dtype: object
    
    #remove old column Message
    source = source.drop(['Message'], axis=1)
    #join Serie s to df source
    df = (source.join(s))
    
    #aggregate size
    print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
       Year  Name   Message  count
    0  1999  John         I      1
    1  1999  John      Love      1
    2  1999  John       You      1
    3  2000  John         I      1
    4  2000  John      Love      2
    5  2000  John       You      1
    6  2000  Mike  Remember      1
    7  2000  Mike      Will      1
    8  2000  Mike       You      1