Search code examples
rdataframetmcorpus

R tm package Upgrade - Error in converting corpus to data frame


Something seems to be gone wrong in the latest tm upgrade. My code as below with test data -

data = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit',
           'Vestibulum posuere nisl vel lobortis vulputate',
           'Quisque eget sem in felis egestas sagittis')
ccorpus_clean = Corpus(VectorSource((data)))
ccorpus_clean = tm_map(ccorpus_clean,removePunctuation,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,tolower,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeNumbers,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stemDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,stopwords("english"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("hi"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("account","can"),lazy=TRUE)     
ccorpus_clean = tm_map(ccorpus_clean,PlainTextDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE);
ccorpus_clean;
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)

Everything was working fine earlier. But suddenly i needed to use ",lazy=TRUE". Without that the corpus transformations stopped working. The lazy problem is documented here - R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code

With Lazy, the transformations work, but the conversion of the corpus back to Data Frame stopped with the below error -

ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean

<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5

df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)

Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning message:
In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
all scheduled cores encountered errors in user code

Edit - This too fails

data.frame(text = sapply(ccorpus_clean, as.character), stringsAsFactors = FALSE)

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error"

R Version - version.string R version 3.2.3 (2015-12-10) / tm - 0.6-2


Solution

  • Looks very complicated. How about:

    data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
              "Some English words are just made for stemming!")
    
    require(quanteda)
    
    # makes the texts into a list of tokens with the same treatment
    # as your tm mapped functions
    toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
    # toks is just a named list
    toks
    ## tokenizedText object from 2 documents.
    ## Component 1 :
    ## [1] "lorem"    "ipsum"    "dolor"    "sit"      "amet"     "account"  "red"      "balloons"
    ## 
    ## Component 2 :
    ## [1] "some"     "english"  "words"    "are"      "just"     "made"     "for"      "stemming"
    
    # remove selected terms
    toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))
    
    # apply stemming
    toks <- wordstem(toks)
    
    # make into a data frame by reassembling the cleaned tokens
    (df <- data.frame(text = sapply(toks, paste, collapse = " ")))
    ##                                     text
    ## 1 lorem ipsum dolor sit amet red balloon
    ## 2            english word just made stem