Search code examples
rcsvtmtopic-modeling

How to change row names of a DTM when writing to .csv in R


I have a large set of documents stored in a folder. I used these documents for text mining, using the tm package. I got all the way to topic modeling and want to write some results to a csv file. However, when doing this, the names of the documents are represented like this: character(0).

I want to have the names of the documents as they are stored in my folder. This is the code I use (relevant steps shown only):

my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                 readerControl = list(reader = readPDF, language = "dutch"))
dtm <- DocumentTermMatrix(my_corpus)
library(topicmodels)
ldaOut <- LDA(dtm, k, method = "Gibbs")
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics, file = paste("LDAGibbs", k, "CorpusToTopics.csv"))

I can't seem to find the answer anywhere. I assume it's a basic code in R that I don't know.


Solution

  • Weird how you loose the document names. I don't seem to be able to reproduce this error and I have a lot of different folders with pdf's and loads of different naming conventions.

    Check the result of dtm$dimnames$Docs just when you created the dtm. If this results in charactor(0), you can do the following to get the document names into the document term matrix.

    pdf_names <- list.files(directory, pattern = ".pdf")
    dtm$dimnames$Docs <- pdf_names