I got data frame as follow:
sentence userid topic
hello 1001 smalltalk
hi 1002 smalltalk
hello 1002 smalltalk
how are you? 1003 question
hello 1004 smalltalk
what is new? 1005 question
hi 1006 smalltalk
hello 1007 smalltalk
With the help of Pandas value_counts
Input:
df['sentence'].value_counts()
Output:
hello 4
hi 2
how are you? 1
what is new? 1
What I really wish to get is the same value count with the a specific column added next to it:
hello 4 smalltalk
hi 2 smalltalk
how are you? 1 question
what is new? 1 question
IIUC, OP needs to keep interested results of df[['sentence', 'topic']].value_counts()
in the form of DataFrame using pandas module for further manipulation\visulaizataions. So this can be achieved by groupby()
and aggregating the counts of interested multiple variables\columns under a new column count
within DataFrame:
import pandas as pd
#Generate dataframe
df = pd.DataFrame({'userid': [1001, 1002, 1002, 1003, 1004, 1005, 1006, 1007],
'sentence': ['hello', "hi", 'hello', "how are you?", 'hello', "what is new?", "hi", 'hello'],
'topic': ["smalltalk", "smalltalk", "smalltalk", "question", "smalltalk", "question", "smalltalk", "smalltalk"],
})
#Aggregate counts with respect to interested columns in df
df2 = df.groupby(["sentence","topic"])["topic"].agg(["count"]) \
.reset_index() \
.drop_duplicates() #remove duplicates
print(df2)
# sentence topic count
#0 hello smalltalk 4
#1 hi smalltalk 2
#2 how are you? question 1
#3 what is new? question 1