I have a corpus I created using the TM package where I've applied all my transformations and am ready to convert it back to a data frame.
When I use
twit[[1]]$content
I can see my data. However when I try to unlist it I get NA's for all my records.
twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)
The linked question Loop through a tm corpus without losing corpus structure has a discussion after the only answer that has the same issue but there does not appear to be a resolution.
Here is some reproducible code.
library(tm)
bbTwit <- as.data.frame(c("Text Line One!", "Text Line 2"), stringsAsFactors = F)
colnames(bbTwit) <- 'Contents'
bbTwit$doc_id <- row.names(bbTwit)
twit <- bbTwit[c('doc_id','Contents')]
colnames(twit) <- c('doc_id','text')
twit <-Corpus(DataframeSource(data.frame(twit)))
twit <-tm_map(twit, removePunctuation)
twit <-tm_map(twit, stripWhitespace)
twit[[1]]$content
twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)
The expected output would be a data frame with 2 observations where "Text Line One" would be the first record and "Text Line 2" would be the second. What I get is two observations of NA
To get the content out, just use the content()
function. For example
content(twit)
# [1] "Text Line One" "Text Line 2"
or put it in a data.frame
data.frame(text=content(twit))
# text
# 1 Text Line One
# 2 Text Line 2