I'm in a project of text mining and we want to categorize a variable by sport (is a variable of free text which describe sports). For this reason I want to stem it. I want to check if the relation between roots and words is correct, so I want to know which roots include which words. I'm working in R, so could someone help me please?
After remove puntuation, numbers, extra whitespace, I'm doing:
library(tm)
myData <- c('natacion gimnasio','gimnasia montana','correr bicicleta','corremontanismo','nadar bici')
corpus <- Corpus(VectorSource(myData))
dictCorpus <- corpus
corpus <- tm_map(corpus, stemDocument, language = "spanish")
inspect(corpus[1:5])
corpus <- tm_map(corpus, stemCompletion, dictionary=dictCorpus)
inspect(corpus[1:5])
Then I have:
I have 3 problems that I don't know how to solve it:
Thanks in advance.
I don't have complete answers to all your questions. But I will try to answer as much as I can.
1) You can go to the snowball website for the Spanish stemming algorithm.
Spanish sample list is here
Corresponding root is here
matching these files will give you a relation between the root and words.
2) Getting a correct match between bici and bicicleta is difficult. They do not have the same lemma or root. You would need a synonym dictionary to help you with this.
3) Returning the word instead of the root is interesting but Spanish has male and female versions. If I look at the lemma's for gimnasio / gimnasia, they are gimnasio and gimnasia even though the root is gimnasi. Which word do you want to return? You might want to decide on this before you start with stemming and create a dictionary that only contains the (fe)male word.