I am trying to get the top 10 most frequent words per class in my dataset. I have the following Python code but I do not understand the output, why this has occurred and how it can be corrected.
Below is the dataset I am using (df)
User Post Label
0 Nicholas Wyman Exploring in this months Talent Management HR... Recruitment
1 Nicholas Wyman I count myself fortunate to have spent time wi... Career
2 Nicholas Wyman This years National Apprenticeship Week comes ... Recruitment
3 Nicholas Wyman How will your company tap into workers as a co... Wellbeing
4 Nicholas Wyman The momentum for Modern Apprenticeships is bui... Recruitment
This is the code I am using
#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')
#Get classes
classes = df['Label'].unique()
classes = classes.tolist()
#Check each class and produce top 10 words
for i in classes:
print(i)
df2=df.loc[df['Label'] == i, 'Post']
df2 = str(remove_stopwords(df['Post']))
from collections import Counter
Frequent = Counter(" ".join(df2).split()).most_common(10)
print(Frequent)
And this is the output
Recruitment
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Career
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Wellbeing
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Rewards
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Technology
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Learning
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
HR System
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Inclusion
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Diversity
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
It seems to be looking at individual letters rather than words and searching the entire dataset rather than just the posts with the chosen label, but I cannot work out why.
#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')
#Get classes
classes = df['Label'].unique()
classes = classes.tolist()
for i in classes:
print(i)
df2=df.loc[df['Label'] == i, 'Post']
df2 = df2.apply(lambda x: remove_stopwords(x))
list_sentences = df2.to_list()
from collections import Counter
list_words = (' '.join(str(s) for s in list_sentences)).split(' ')
Frequent = Counter(list_words).most_common(10)
print(Frequent)
EDIT: You df2 is first a pandas series and then a string. I am not sure what "remove_stopwords" function you are using, I guess it is the one from gensim. I adapted the code
EDIT2: this time it should work