Search code examples
pythoncounterdata-analysisword-frequencyexploratory-data-analysis

Unexpected output when creating a list of frequent words. How can I get the top 10 most frequent words for a given class?


I am trying to get the top 10 most frequent words per class in my dataset. I have the following Python code but I do not understand the output, why this has occurred and how it can be corrected.

Below is the dataset I am using (df)

User    Post    Label
0   Nicholas Wyman  Exploring in this months Talent Management HR...    Recruitment
1   Nicholas Wyman  I count myself fortunate to have spent time wi...   Career
2   Nicholas Wyman  This years National Apprenticeship Week comes ...   Recruitment
3   Nicholas Wyman  How will your company tap into workers as a co...   Wellbeing
4   Nicholas Wyman  The momentum for Modern Apprenticeships is bui...   Recruitment

This is the code I am using

#Import dataset
df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')

#Get classes
classes = df['Label'].unique()
classes = classes.tolist()

#Check each class and produce top 10 words
for i in classes:
  print(i)
  df2=df.loc[df['Label'] == i, 'Post']
  df2 = str(remove_stopwords(df['Post']))
  from collections import Counter
  Frequent = Counter(" ".join(df2).split()).most_common(10)
  print(Frequent)

And this is the output

Recruitment
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Career
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Wellbeing
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Rewards
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Technology
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Learning
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
HR System
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Inclusion
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]
Diversity
[("'", 1213698), (',', 606859), ('e', 507474), ('i', 321003), ('a', 311593), ('n', 303956), ('t', 296568), ('s', 290978), ('r', 276802), ('o', 261175)]

It seems to be looking at individual letters rather than words and searching the entire dataset rather than just the posts with the chosen label, but I cannot work out why.


Solution

  • #Import dataset
    df = pd.read_csv("Folds1345.csv", engine='python',encoding='latin-1')
    
    #Get classes
    classes = df['Label'].unique()
    classes = classes.tolist()
    
    for i in classes:
      print(i)
      df2=df.loc[df['Label'] == i, 'Post']
      df2 = df2.apply(lambda x: remove_stopwords(x))
      list_sentences = df2.to_list()
      from collections import Counter
      list_words = (' '.join(str(s) for s in list_sentences)).split(' ')
      Frequent = Counter(list_words).most_common(10)
      print(Frequent)
    

    EDIT: You df2 is first a pandas series and then a string. I am not sure what "remove_stopwords" function you are using, I guess it is the one from gensim. I adapted the code

    EDIT2: this time it should work