r nlp tm

How can I get a list between roots and words after stemmed a document in R?

I'm in a project of text mining and we want to categorize a variable by sport (is a variable of free text which describe sports). For this reason I want to stem it. I want to check if the relation between roots and words is correct, so I want to know which roots include which words. I'm working in R, so could someone help me please?

After remove puntuation, numbers, extra whitespace, I'm doing:

library(tm)
myData <- c('natacion gimnasio','gimnasia montana','correr bicicleta','corremontanismo','nadar bici')
corpus <- Corpus(VectorSource(myData))
dictCorpus <- corpus
corpus <- tm_map(corpus, stemDocument, language = "spanish")
inspect(corpus[1:5])
corpus <- tm_map(corpus, stemCompletion, dictionary=dictCorpus)
inspect(corpus[1:5])

Then I have:

I have 3 problems that I don't know how to solve it:

A list with the relationship between roots-words (for example: root = gimnasi; words = gimnasio, gimnasia | root = montan; words = montana, montanismo). I want to see the relationship of each of the roots with their associated words.
How to make the correct match (bicicleta == bici, but stemDocument doesn't connect them).
Change the root for the word when stemCompletion is applied.

Thanks in advance.

Solution

I don't have complete answers to all your questions. But I will try to answer as much as I can.

1) You can go to the snowball website for the Spanish stemming algorithm.

Spanish sample list is here

Corresponding root is here

matching these files will give you a relation between the root and words.

2) Getting a correct match between bici and bicicleta is difficult. They do not have the same lemma or root. You would need a synonym dictionary to help you with this.

3) Returning the word instead of the root is interesting but Spanish has male and female versions. If I look at the lemma's for gimnasio / gimnasia, they are gimnasio and gimnasia even though the root is gimnasi. Which word do you want to return? You might want to decide on this before you start with stemming and create a dictionary that only contains the (fe)male word.