Can't get metadata from dataframe using DataframeSource in tm for R

I have a dataframe with the following variables:

doc_id  text  URL  author  date  forum

When I run

samplecorpus <- Corpus(DataframeSource(sampledataframe))

the documentation says I should get a corpus with all of the extra variables added as document-level metadata. https://rdrr.io/rforge/tm/man/DataframeSource.html http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html

Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.

Someone else asked a similar question, but it never got answered... In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata

Does anyone have any ideas on how to fix this?

Thanks!

Solution

You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.

docs <- data.frame(doc_id = c("doc_1", "doc_2"),
                   text = c("This is a text.", "This another one."),
                   url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
                           "https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r"), 
                   author = c("Emi", "Emi"),
                   date = as.Date(c("2018-09-20", "2018-09-21")),
                   forum = c("stackoverflow", "stackoverflow"),
                   tags = c("r", "tm"),
                   stringsAsFactors = T)

# use Corpus or VCorpus
my_corpus <- Corpus(DataframeSource(docs))
meta(my_corpus)

    url author       date
1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r    Emi 2018-09-20
2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r    Emi 2018-09-21
          forum tags
1 stackoverflow    r
2 stackoverflow   tm

my_index <- meta(my_corpus, "tags") == "r"

inspect(my_corpus[my_index])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 5
Content:  documents: 1

          doc_1 
This is a text.

Now beware there is a difference in how meta is treated. If you do str(my_corpus) you will see the following:

List of 2
 $ doc_1:List of 2
  ..$ content: chr "This is a text."
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "doc_1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ doc_2:List of 2
......

The meta info you see here is from meta(my_corpus, type = "local"). The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")

Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.