r tm

Unlisting Corpus from TM package giving NA's

I have a corpus I created using the TM package where I've applied all my transformations and am ready to convert it back to a data frame.

When I use

twit[[1]]$content

I can see my data. However when I try to unlist it I get NA's for all my records.

twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)

The linked question Loop through a tm corpus without losing corpus structure has a discussion after the only answer that has the same issue but there does not appear to be a resolution.

Here is some reproducible code.

library(tm)
bbTwit <- as.data.frame(c("Text Line One!", "Text Line 2"), stringsAsFactors = F)
colnames(bbTwit) <- 'Contents'
bbTwit$doc_id <- row.names(bbTwit) 
twit <- bbTwit[c('doc_id','Contents')]
colnames(twit) <- c('doc_id','text')

twit <-Corpus(DataframeSource(data.frame(twit)))
twit <-tm_map(twit, removePunctuation)
twit <-tm_map(twit, stripWhitespace)

twit[[1]]$content

twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)

The expected output would be a data frame with 2 observations where "Text Line One" would be the first record and "Text Line 2" would be the second. What I get is two observations of NA

Solution

To get the content out, just use the content() function. For example

content(twit)
# [1] "Text Line One" "Text Line 2"

or put it in a data.frame

data.frame(text=content(twit))
#            text
# 1 Text Line One
# 2   Text Line 2