I'm quite new to R and currently working on a project for my studies (readability vs performance of annual reports). I've literally screened hundreds of posts but could not find a proper solution. So, I'm stuck and need you're help.
My goal is to tm roughly 1000 text documents and export the edited texts from the VCorpus into a folder, including the original file names.
So far I managed to import & do (some) text mining:
### folder of txt files
dest <- ("C:\\Xpdf_pdftotext\\TestCorpus")
### create a Corpus in R
docs <- VCorpus(DirSource(dest))
### do some text mining of the txt-documents
for (j in seq(docs)) {
docs[[j]] <- gsub("\\d", "", docs[[j]])
docs[[j]] <- gsub("\\b[A-z]\\b{3}", "", docs[[j]])
docs[[j]] <- gsub("\\t", "", docs[[j]])
}
Export each file in the Corpus with its original file names. works for 1 file, when assigning a new name:
writeLines(as.character(docs[1]), con="text1.txt")
I've found the command for the meta ID in a post, but I don't know how to include it in my code
docs[[1]]$meta$id
How can I efficiently export the edited textfiles from the VCorpus including their original file names?
Thanks for helping
Actually it is very simple.
If you have a corpus loaded as you did, you can write the whole corpus to disk in one command with using writeCorpus
. The meta tag id needs to be filled in, but in your case that is already done how you loaded the data.
If we take the crude
dataset as an example, the id's are already included:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
# bit of textcleaning
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
#write to disk in subfolder data
writeCorpus(crude, path = "data/")
# check files
dir("data/")
[1] "127.txt" "144.txt" "191.txt" "194.txt" "211.txt" "236.txt" "237.txt" "242.txt" "246.txt" "248.txt" "273.txt" "349.txt" "352.txt"
[14] "353.txt" "368.txt" "489.txt" "502.txt" "543.txt" "704.txt" "708.txt"
The files from the crude
dataset are written to disk with the id's as filenames.