Search code examples
rreplacewordstmstemming

Misspelling-aware stemming with R Text Analysis


I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.

I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??

I have a data frame with the dictionary and this have the following structure:

sept   -> september  
sep    -> september  
acct -> account  
serv  -> service  
servic     -> service  
adj    ->   adjustment  
ajuste   -> adjustment  

I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.

Any help would be greatly appreciated.


Solution

  • For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.

    All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.