I have a dataframe with the following variables:
doc_id text URL author date forum
When I run
samplecorpus <- Corpus(DataframeSource(sampledataframe))
the documentation says I should get a corpus with all of the extra variables added as document-level metadata. https://rdrr.io/rforge/tm/man/DataframeSource.html http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html
Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.
Someone else asked a similar question, but it never got answered... In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata
Does anyone have any ideas on how to fix this?
You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.
docs <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("This is a text.", "This another one."),
url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
author = c("Emi", "Emi"),
date = as.Date(c("2018-09-20", "2018-09-21")),
forum = c("stackoverflow", "stackoverflow"),
tags = c("r", "tm"),
stringsAsFactors = T)
# use Corpus or VCorpus
my_corpus <- Corpus(DataframeSource(docs))
url author date
1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-20
2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-21
forum tags
1 stackoverflow r
2 stackoverflow tm
my_index <- meta(my_corpus, "tags") == "r"
Metadata: corpus specific: 1, document level (indexed): 5
Content: documents: 1
This is a text.
Now beware there is a difference in how meta is treated. If you do str(my_corpus)
you will see the following:
List of 2
$ doc_1:List of 2
..$ content: chr "This is a text."
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "doc_1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ doc_2:List of 2
The meta info you see here is from meta(my_corpus, type = "local")
. The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")
Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.