Search code examples
rspacynamed-entity-recognition

CleanNLP package in R: metadata data frame?


Let's assume my dataframe looks like this:

bio_text <- c("Georg Aemilius, eigentlich Georg Oemler, andere Namensvariationen „Aemylius“ und „Emilius“ (* 25. Juni 1517 in Mansfeld; † 22. Mai 1569 in Stolberg (Harz))...", "Johannes Aepinus auch: Johann Hoeck, Huck, Hugk, Hoch oder Äpinus (* um 1499 in Ziesar; † 13. Mai 1553 in Hamburg) war ein deutscher evangelischer Theologe und Reformator.\nAepinus wurde als Sohn des Ratsherrn Hans Hoeck im brandenburgischen Ziesar 1499 geboren...")
doc_id <- c("1", "2")
url <- c("https://de.wikipedia.org/wiki/Georg_Aemilius", "https://de.wikipedia.org/wiki/Johannes_Aepinus")
name <- c("Aemilius, Georg", "Aepinus, Johannes")
place_of_birth <- c("Mansfeld", "Ziesar")

full_wikidata <- data.frame(bio_text, doc_id, url, name, place_of_birth)

I want to carry out Named Entity Recognition with the cleanNLP package in R. Therefore, I initialize the tokenizers and the spaCy backend, everything works fine:

options(stringsAsFactors = FALSE)
library(cleanNLP)

cnlp_init_tokenizers()

require(reticulate)
cnlp_init_spacy("de")

wikidata <- full_wikidata[,c("doc_id", "bio_text")]
wikimeta <- full_wikidata[,c("url", "name", "place_of_birth")]

spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(wikidata, as_strings = TRUE, meta = wikimeta)

My only problem is the metadata. When I run it like this, I get the following warning message: In cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE, meta = wikimeta) : data frame input given along with meta; ignoring the latter. To be honest, I don't get the documentation concerning meta in cnlp_annotate: "an optional data frame to bind to the document table". This means that I should deliver a data frame containing the metadata, right?! Later on, I want to be able to do something like this, e.g. filter out all person entities in document no. 3:

cnlp_get_entity(spacy_annotatedWikidata) %>%
  filter(doc_id == 3, entity_type == "PER") %>%
  count(entity)

Therefore, I have to find a way to access the metadata. Any help would be highly appreciated!


Solution

  • Fortunatelly, in the meantime I got some help and the advice to take a closer look at the method code of cnlp_annotate on Github: https://github.com/statsmaths/cleanNLP/blob/master/R/annotate.R It says that you only can pass in a metadata dataframe if the input itself is not a dataframe but a file path. So if you do want to pass in a dataframe, the first row has to be doc_id, the second text and the remaining ones are automatically considered as metadata! So in my example only the order in full_wikidata has to be changed:

    full_wikidata <- data.frame(doc_id, bio_text, url, name, place_of_birth)
    

    Like this, it can be directly used as an input in clnp_annotate:

    spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE)