I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.
I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??
I have a data frame with the dictionary and this have the following structure:
sept -> september
sep -> september
acct -> account
serv -> service
servic -> service
adj -> adjustment
ajuste -> adjustment
I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.
Any help would be greatly appreciated.
For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.
All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist
) to find the closest match in the dictionary.