Search code examples
pythonpandasstemming

Word frequency with stemming


I would have a question on how to get the sum of words which I consider having similar meaning, so I would like to count as the same word.

For example, I have this dataset:

    Word    Frequency
0   game    52055
1   laura   24953
2   luke    21133
3   story   20739
4   dog     17054
5   like    12792
7   character   8845
9   play    8420
11  characters  8081
12  people  7933
16  good    6496
18  10      6309
19  gameplay6195
22  revenge 5922
25  bad     5331
26  end     5027
27  feel    4833
28  killed  4779
31  kill    4545
33  graphics4372
34  time    4272
35  cat     4244
44  great   3466
45  ending  3379
...
50  love    3059
51  never   2965
52  new     2963
53  killing 2955

This is a dataset with two columns: one with words and another one with their frequency through the document. I would need to consider as same words the following:

  • kill, killing, killed;
  • character and characters;
  • end, ending.

I think this should be easily done by using portstemmer. However, I would need also to count their frequency as sum.

So, for example,

28  killed  4779
31  kill    4545
53  killing 2955

should be

31 kill 12279

Unfortunately I could not apply earlier stemming as the dataset I received is as shown above. Could you please give me some advice on how to get this sum?


Solution

  • You can use nltk (df being the input dataframe you've shared):

    from nltk.stem import PorterStemmer 
    from nltk.tokenize import word_tokenize 
    
    ps = PorterStemmer() 
    df["Stem"] = df["Word"].apply(ps.stem)
    res = df.groupby("Stem")["Frequency"].sum()
    

    Outputs (for the piece you shared):

    Stem
    10           6309
    bad          5331
    cat          4244
    charact     16926
    dog         17054
    end          8406
    feel         4833
    game        52055
    gameplay     6195
    good         6496
    graphic      4372
    great        3466
    kill        12279
    laura       24953
    like        12792
    love         3059
    luke        21133
    never        2965
    new          2963
    peopl        7933
    play         8420
    reveng       5922
    stori       20739
    time         4272
    Name: Frequency, dtype: int64