Search code examples
rfor-looptext-miningtmcorpus

Store multiple corpus via for loop by different names


I have multiple text documents per ticker which I want to store as an individual corpus. I've read about creating ''lists in lists'', but this doesn't work for me. For example, ''text mining and termdocumentmatrix'' give the following error: no applicable method for 'TermDocumentMatrix' applied to an object of class "list.

I could possibly put everything within the for loop, but that's not what I want since I want some flexibility to play with the corpus.

Could someone help me out how I can effectively work around this problem? My code is below. Thank you in advance!

Stocks <- list("AAPL", "AMZN", "BIG", "BYD", "CTWS", "EAT", "FB", "GOOG", "GRMC", "HRL", "MGM", "MSFT",
               "NEM", "PKS", "RGLD", "SCCO", "SLP", "TCO", "USGL", "WDFC"
)

BigList <- list()
for (stock in Stocks) {
  filepath <- file.path("C:/Users/......./Stocks10K", stock)
  a <- Corpus(DirSource(filepath))
  a <- tm_map(a, removePunctuation)
  a <- tm_map(a, removeNumbers)
  a <- tm_map(a, tolower)
  a <- tm_map(a, removeWords, stopwords("en"))
  a <- tm_map(a, stripWhitespace)
  name <- paste('Data:', stock, sep='')
  tmp <- list(Text = a)
  BigList[name] <- tmp
  rm(tmp, stock, name, filepath, a)
}

#Create Term Document Matrix and create Matrix
tdm <- TermDocumentMatrix(BigList['Data:AAPL'])
m <- as.matrix(tdm)

Solution

  • It looks like you've done everything right, except getting your entry out of BigList. [ will return a list (containing one element in your case) - you need [[ instead. Try:

    tdm <- TermDocumentMatrix(BigList[['Data:AAPL']])
    

    instead.

    https://cran.r-project.org/doc/manuals/R-lang.html#Indexing has more info, including this note (in case what I said above isn't clear):

    For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.