Search code examples
rdataframemetadatatmcorpus

Can't get metadata from dataframe using DataframeSource in tm for R


I have a dataframe with the following variables:

doc_id  text  URL  author  date  forum 

When I run

samplecorpus <- Corpus(DataframeSource(sampledataframe))

the documentation says I should get a corpus with all of the extra variables added as document-level metadata. https://rdrr.io/rforge/tm/man/DataframeSource.html http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html

Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.

Someone else asked a similar question, but it never got answered... In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata

Does anyone have any ideas on how to fix this?

Thanks!


Solution

  • You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.

    docs <- data.frame(doc_id = c("doc_1", "doc_2"),
                       text = c("This is a text.", "This another one."),
                       url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
                               "https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r"), 
                       author = c("Emi", "Emi"),
                       date = as.Date(c("2018-09-20", "2018-09-21")),
                       forum = c("stackoverflow", "stackoverflow"),
                       tags = c("r", "tm"),
                       stringsAsFactors = T)
    
    # use Corpus or VCorpus
    my_corpus <- Corpus(DataframeSource(docs))
    meta(my_corpus)
    
        url author       date
    1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r    Emi 2018-09-20
    2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r    Emi 2018-09-21
              forum tags
    1 stackoverflow    r
    2 stackoverflow   tm
    
    my_index <- meta(my_corpus, "tags") == "r"
    
    inspect(my_corpus[my_index])
    <<SimpleCorpus>>
    Metadata:  corpus specific: 1, document level (indexed): 5
    Content:  documents: 1
    
              doc_1 
    This is a text. 
    

    Now beware there is a difference in how meta is treated. If you do str(my_corpus) you will see the following:

    List of 2
     $ doc_1:List of 2
      ..$ content: chr "This is a text."
      ..$ meta   :List of 7
      .. ..$ author       : chr(0) 
      .. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
      .. ..$ description  : chr(0) 
      .. ..$ heading      : chr(0) 
      .. ..$ id           : chr "doc_1"
      .. ..$ language     : chr "en"
      .. ..$ origin       : chr(0) 
      .. ..- attr(*, "class")= chr "TextDocumentMeta"
      ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
     $ doc_2:List of 2
    ......
    

    The meta info you see here is from meta(my_corpus, type = "local"). The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")

    Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.