I'm using tm
in R, and dealing with 10k documents. I wanted to inspect some by indices, but they weren't matching the files. Why does tm load large amounts of documents in a weird order, and how can it be fixed/subverted? Here is an example:
library(tm)
docs <- c()
for (i in 1:10000) {
docs <- c(docs, paste('test', i))
}
cor <- VCorpus(VectorSource(docs))
filepath = '/home/nate/Dropbox/MSDS/MSDS682_ncg_F8W2_17/test_cor'
writeCorpus(cor, path = filepath)
cor2 <- VCorpus(DirSource(filepath))
as.character(cor2[[1]])
as.character(cor2[[2]])
as.character(cor2[[3]])
as.character(cor2[[4]])
This prints out:
test 10000
test 1000
test 1001
test 1002
This result comes about due to the filenames created by writeCorpus
. In your path you will find files named 1.txt, 10.txt, 100.txt, 1000.txt, 1001.txt ... n.txt
When you read them back in with DirSource
they are coming in using that text sort instead of your expected numeric.
To keep your sort order as intended you can add the filenames
argument to writeCorpus
, for example:
writeCorpus(
cor,
path = filepath,
filenames = paste0(sprintf("%05d", 1:length(cor)), ".txt")
)
Which will make your file output, 00001.txt, 00002.txt, 00003.txt ... n.txt
, and your import back from disk will be read in the correct order.