Up until recently (1 month ago) the code shown below allowed me to import a series of .txt documents stored in a local folder into R, to create a Corpus, pre-process it and finally to convert it into a Document Term Matrix. The issue I am having is that the document names are not being imported, instead each document is listed as 'character(0)'.
One of my aims is to conduct topic modelling on the corpus and so it is important that I can relate the document names to the topics that the model produces.
Does anyone have any suggestions as to what has changed? Or how I can fix this?
library("tm")
library("SnowballC")
setwd("C:/Users/Documents/Dataset/")
corpus <-Corpus(DirSource("blog"))
#pre_processing
myStopwords <- c(stopwords("english"))
your_corpus <- tm_map(corpus, tolower)
your_corpus <- tm_map(your_corpus, removeNumbers)
your_corpus <- tm_map(your_corpus, removeWords, myStopwords)
your_corpus <- tm_map(your_corpus, stripWhitespace)
your_corpus <- tm_map(your_corpus, removePunctuation)
your_corpus <- tm_map(your_corpus, stemDocument)
your_corpus <- tm_map(your_corpus, PlainTextDocument)
#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))
dim(myDtm)
inspect(myDtm)
Here's a debugging session to identify / correct the loss of file name. The tolower line was modified, and the plaintext line was commented-out since these lines remove the file information. Also, if you check ds$reader, you can see the baseline reader creates a plain text document.
library("tm")
library("SnowballC")
# corpus <-Corpus(DirSource("blog"))
sf<-system.file("texts", "txt", package = "tm")
ds <-DirSource(sf)
your_corpus <-Corpus(ds)
# Check status with the following line
meta(your_corpus[[1]])
#pre_processing
myStopwords <- c(stopwords("english"))
# your_corpus <- tm_map(your_corpus, tolower)
your_corpus <- tm_map(your_corpus, content_transformer(tolower))
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeNumbers)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removeWords, myStopwords)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stripWhitespace)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, removePunctuation)
meta(your_corpus[[1]])
your_corpus <- tm_map(your_corpus, stemDocument)
meta(your_corpus[[1]])
#your_corpus <- tm_map(your_corpus, PlainTextDocument)
#meta(your_corpus[[1]])
#creating a doucment term matrix
myDtm <- DocumentTermMatrix(your_corpus, control=list(wordLengths=c(3,Inf)))
dim(myDtm)
inspect(myDtm)