Search code examples
rgsubtmmapply

Use mapply to replace a patterns in a vector with replacements in a vector in tm


Hi there: I m using the tm package for some text analysis and I need to sub a vector of terms with the paired replacement term in a vector of replacements. So the pattern / replacement dictionary looks like this.

#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')

I tried this and received an error

tm_map(crude, mapply, gsub, df$replace, df$with)

Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code

Solution

  • Based on this answer you could use stringi and wrap it around content_transformer() to preserve the corpus structure:

    corp <- tm_map(crude, content_transformer(
      function(x) { 
        stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE) 
        })
      )
    

    Or multigsub from qdap

    corp <- tm_map(crude, content_transformer(
      function(x) { 
        multigsub(df$replace, df$with, fixed = FALSE, x) 
        })
      )
    

    Which gives:

    > corp[[1]][1]
    

    "Diamond Shamrock Corp said that\neffective today it had cut its contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
    Diamond is the latest in a line of U.S. xoil companies that\nhave cut its contract, or posted, xprices over the last two days\nciting weak xoil markets.\n Reuter"

    You can then apply other tm functions on the resulting corpus:

    > DocumentTermMatrix(corp)
    #<<DocumentTermMatrix (documents: 20, terms: 1269)>>
    #Non-/sparse entries: 2262/23118
    #Sparsity           : 91%
    #Maximal term length: 17
    #Weighting          : term frequency (tf)