I would have a question on how to get the sum of words which I consider having similar meaning, so I would like to count as the same word.
For example, I have this dataset:
Word Frequency
0 game 52055
1 laura 24953
2 luke 21133
3 story 20739
4 dog 17054
5 like 12792
7 character 8845
9 play 8420
11 characters 8081
12 people 7933
16 good 6496
18 10 6309
19 gameplay6195
22 revenge 5922
25 bad 5331
26 end 5027
27 feel 4833
28 killed 4779
31 kill 4545
33 graphics4372
34 time 4272
35 cat 4244
44 great 3466
45 ending 3379
...
50 love 3059
51 never 2965
52 new 2963
53 killing 2955
This is a dataset with two columns: one with words and another one with their frequency through the document. I would need to consider as same words the following:
I think this should be easily done by using portstemmer. However, I would need also to count their frequency as sum.
So, for example,
28 killed 4779
31 kill 4545
53 killing 2955
should be
31 kill 12279
Unfortunately I could not apply earlier stemming as the dataset I received is as shown above. Could you please give me some advice on how to get this sum?
You can use nltk
(df
being the input dataframe you've shared):
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
df["Stem"] = df["Word"].apply(ps.stem)
res = df.groupby("Stem")["Frequency"].sum()
Outputs (for the piece you shared):
Stem
10 6309
bad 5331
cat 4244
charact 16926
dog 17054
end 8406
feel 4833
game 52055
gameplay 6195
good 6496
graphic 4372
great 3466
kill 12279
laura 24953
like 12792
love 3059
luke 21133
never 2965
new 2963
peopl 7933
play 8420
reveng 5922
stori 20739
time 4272
Name: Frequency, dtype: int64