Search code examples
pandasdataframemultiple-columns

Pandas value_counts with added column


I got data frame as follow:

sentence   userid  topic
hello        1001  smalltalk
hi           1002  smalltalk
hello        1002  smalltalk
how are you? 1003  question
hello        1004  smalltalk
what is new? 1005  question
hi           1006  smalltalk
hello        1007  smalltalk

With the help of Pandas value_counts

Input:

df['sentence'].value_counts()

Output:

hello 4
hi 2
how are you? 1
what is new? 1

What I really wish to get is the same value count with the a specific column added next to it:

hello 4 smalltalk
hi 2 smalltalk
how are you? 1 question
what is new? 1 question

Solution

  • IIUC, OP needs to keep interested results of df[['sentence', 'topic']].value_counts() in the form of DataFrame using module for further manipulation\visulaizataions. So this can be achieved by groupby() and aggregating the counts of interested multiple variables\columns under a new column count within DataFrame:

    import pandas as pd
    
    #Generate dataframe
    df = pd.DataFrame({'userid':    [1001, 1002, 1002, 1003, 1004, 1005, 1006, 1007],    
                        'sentence': ['hello', "hi", 'hello', "how are you?", 'hello', "what is new?", "hi", 'hello'],    
                        'topic':    ["smalltalk", "smalltalk", "smalltalk", "question", "smalltalk", "question", "smalltalk", "smalltalk"],
                        })
    
    #Aggregate counts with respect to interested columns in df
    df2 = df.groupby(["sentence","topic"])["topic"].agg(["count"]) \
            .reset_index() \
            .drop_duplicates() #remove duplicates
    
    print(df2) 
    #       sentence      topic  count
    #0         hello  smalltalk      4
    #1            hi  smalltalk      2
    #2  how are you?   question      1
    #3  what is new?   question      1