Search code examples
rexportfilenamestmcorpus

Exporting textfiles from VCorpus including original file names in R


I'm quite new to R and currently working on a project for my studies (readability vs performance of annual reports). I've literally screened hundreds of posts but could not find a proper solution. So, I'm stuck and need you're help.

My goal is to tm roughly 1000 text documents and export the edited texts from the VCorpus into a folder, including the original file names.

So far I managed to import & do (some) text mining:

### folder of txt files    

dest <- ("C:\\Xpdf_pdftotext\\TestCorpus")

### create a Corpus in R

docs <- VCorpus(DirSource(dest))

### do some text mining of the txt-documents

for (j in seq(docs)) {
  docs[[j]] <- gsub("\\d", "", docs[[j]])
  docs[[j]] <- gsub("\\b[A-z]\\b{3}", "", docs[[j]])
  docs[[j]] <- gsub("\\t", "", docs[[j]])
}

Export each file in the Corpus with its original file names. works for 1 file, when assigning a new name:

writeLines(as.character(docs[1]), con="text1.txt")

I've found the command for the meta ID in a post, but I don't know how to include it in my code

docs[[1]]$meta$id

How can I efficiently export the edited textfiles from the VCorpus including their original file names?

Thanks for helping


Solution

  • Actually it is very simple.

    If you have a corpus loaded as you did, you can write the whole corpus to disk in one command with using writeCorpus. The meta tag id needs to be filled in, but in your case that is already done how you loaded the data.

    If we take the crude dataset as an example, the id's are already included:

    library(tm)
    data("crude")
    crude <- as.VCorpus(crude)
    # bit of textcleaning
    crude <- tm_map(crude, stripWhitespace)
    crude <- tm_map(crude, removePunctuation)
    crude <- tm_map(crude, content_transformer(tolower))
    crude <- tm_map(crude, removeWords, stopwords("english"))
    
    #write to disk in subfolder data
    writeCorpus(crude, path = "data/")
    
    # check files
    dir("data/")
    [1] "127.txt" "144.txt" "191.txt" "194.txt" "211.txt" "236.txt" "237.txt" "242.txt" "246.txt" "248.txt" "273.txt" "349.txt" "352.txt"
    [14] "353.txt" "368.txt" "489.txt" "502.txt" "543.txt" "704.txt" "708.txt"
    

    The files from the crude dataset are written to disk with the id's as filenames.