Search code examples
rtexttmcorpus

Create a Corpus from a List of File Paths in R


I have 1030 individual .txt files in a directory which represent all the participants in a research study.

I have successfully created a corpus for use with the tm package in R out of all the files in the directory.

Now I'm trying to create corpi of numerous subsets of these files. For example, one corpus of all the female authors and one of the male authors.

I was hoping to be able to pass the Corpus function subsets of a list of file paths, but this has not worked out.

Any help is appreciated. Here is an example to build from:

pathname <- c("C:/Desktop/Samples")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = T) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/Desktop/Samples/author1.txt","C:/Desktop/Samples/author2.txt","C:/Desktop/Samples/author3.txt","C:/Desktop/Samples/author4.txt","C:/Desktop/Samples/author5.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- Corpus(women.files)
women_corpus <- Corpus(DirSource(women.files))
women_corpus <- Corpus(DirSource(unlist(women.files)))

The subsets I need to create are rather elaborate, so I can't easily make new folders containing only the text files of interest for each corpus.


Solution

  • This is working as you wish i think.

    pathname <- c("C:/data/test")
    
    study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = F) 
    
    ### This gives me a character vector that is equivalent to:
    
    study.files <- c("C:/data/test/test1/test1.txt",
                     "C:/data/test/test2/test2.txt",
                     "C:/data/test/test3/test3.txt")
    
    ### I define my subsets with numeric vectors
    
    women <- c(1,3)
    men <- c(2)
    
    ### This creates new character vectors containing the file paths
    women.files <- study.files[women]
    men.files <- study.files[men]
    
    ### Here are the things I've tried to create a corpus from the subsetted list. None of these work.
    
    women_corpus <- NULL
    nedir <- lapply(women.files, function (filename) read.table(filename, sep="\t", stringsAsFactors = F))
    hepsi <- lapply( nedir, function(x) x$V1)
    women_corpus <- Corpus(VectorSource(hepsi))