I'm dealing with some nlp tasks. My inputs are french text and so, only Snowball Stemmer is usable in my context. But, unfortunately, it keeps giving me poor stems as it wouldn't remove even plural "s"
or silent e
. Below is some example:
from nltk.stem import SnowballStemmer
SnowballStemmer("french").stem("pommes, noisettes dorées & moelleuses, la boîte de 350g")
Output: 'pommes, noisettes dorées & moelleuses, la boîte de 350g'
Stemmers stem words not sentences, so tokenize the sentence and stem the tokens individually.
>>> from nltk import word_tokenize
>>> from nltk.stem import SnowballStemmer
>>> fr = SnowballStemmer('french')
>>> sent = "pommes, noisettes dorées & moelleuses, la boîte de 350g"
>>> word_tokenize(sent)
['pommes', ',', 'noisettes', 'dorées', '&', 'moelleuses', ',', 'la', 'boîte', 'de', '350g']
>>> [fr.stem(word) for word in word_tokenize(sent)]
['pomm', ',', 'noiset', 'dor', '&', 'moelleux', ',', 'la', 'boît', 'de', '350g']
>>> ' '.join([fr.stem(word) for word in word_tokenize(sent)])
'pomm , noiset dor & moelleux , la boît de 350g'