Search code examples
rnlpstemming

Italian Stemmer alternative to Snowball


I'm trying to analyze the texts in Italian in R. As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords. But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.

To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.

  stemDocument(content(myCorpus[[1]]),language = "italian")

The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers? or is there a way to implement the stemming, already present in the TM library, by adding new terms?


Solution

  • Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.

    Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section


    Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.