Search code examples
rtm

How to fix the tm package loading large amounts of documents in strange order?


I'm using tm in R, and dealing with 10k documents. I wanted to inspect some by indices, but they weren't matching the files. Why does tm load large amounts of documents in a weird order, and how can it be fixed/subverted? Here is an example:

library(tm)

docs <- c()
for (i in 1:10000) {
  docs <- c(docs, paste('test', i))
}

cor <- VCorpus(VectorSource(docs))

filepath = '/home/nate/Dropbox/MSDS/MSDS682_ncg_F8W2_17/test_cor'
writeCorpus(cor, path = filepath)

cor2 <- VCorpus(DirSource(filepath))

as.character(cor2[[1]])
as.character(cor2[[2]])
as.character(cor2[[3]])
as.character(cor2[[4]])

This prints out:

test 10000
test 1000
test 1001
test 1002

Solution

  • This result comes about due to the filenames created by writeCorpus. In your path you will find files named 1.txt, 10.txt, 100.txt, 1000.txt, 1001.txt ... n.txt

    When you read them back in with DirSource they are coming in using that text sort instead of your expected numeric.

    To keep your sort order as intended you can add the filenames argument to writeCorpus, for example:

    writeCorpus(
      cor,
      path = filepath,
      filenames = paste0(sprintf("%05d", 1:length(cor)), ".txt")
    )
    

    Which will make your file output, 00001.txt, 00002.txt, 00003.txt ... n.txt, and your import back from disk will be read in the correct order.