I have a dataframe with the following variables:
doc_id text URL author date forum
When I run
samplecorpus <- Corpus(DataframeSource(sampledataframe))
the documentation says I should get a corpus with all of the extra variables added as document-level metadata. https://rdrr.io/rforge/tm/man/DataframeSource.html http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html
Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.
Someone else asked a similar question, but it never got answered... In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata
Does anyone have any ideas on how to fix this?
Thanks!
You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.
docs <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("This is a text.", "This another one."),
url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
"https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r"),
author = c("Emi", "Emi"),
date = as.Date(c("2018-09-20", "2018-09-21")),
forum = c("stackoverflow", "stackoverflow"),
tags = c("r", "tm"),
stringsAsFactors = T)
# use Corpus or VCorpus
my_corpus <- Corpus(DataframeSource(docs))
meta(my_corpus)
url author date
1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-20
2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-21
forum tags
1 stackoverflow r
2 stackoverflow tm
my_index <- meta(my_corpus, "tags") == "r"
inspect(my_corpus[my_index])
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 5
Content: documents: 1
doc_1
This is a text.
Now beware there is a difference in how meta is treated. If you do str(my_corpus)
you will see the following:
List of 2
$ doc_1:List of 2
..$ content: chr "This is a text."
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "doc_1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ doc_2:List of 2
......
The meta info you see here is from meta(my_corpus, type = "local")
. The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")
Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.