Scan bibtexkeys in Rmarkdown documents

I love the simplicity of Rmarkdown to produce documents and I am maintaining my own library in a Bibtex (*.bib) document. I'm using these instructions to cite in document (bibtexkey leaded by "@" symbol).

My question is: Is there a way to scan the Rmarkdown document (*.Rmd) and extract a list of bibtexkeys cited in the document? This could be great to produce a subset of my library to be attached to the project instead of all the ca. 6000 references accumulated in my library.

Solution

After exploring several alternatives, I came to the function str_extract() from the package stringr. Here I am assuming, you have a bibtex library including all cited references (usually more). I also combined the example of Oto Kaláb with an own because of the different bibtexkey styles.

First the Rmd document.

rmd_text <- c("# Introduction",
        "",
        "Lorem ipsum dolor sit amet [@bibkey_a], consectetur adipisici elit [@bibkey_b],",
        "sed eiusmod tempor incidunt ut labore et dolore magna aliqua [@bibkey_c;@bibkey_d].",
        "",
        "According to @Noname2000, the world is round [@Ladybug1999;Ladybug2009].",
        "This knowledge got lost [@Ladybug2009a].")
writeLines(rmd_text, "document.Rmd")

The next code block is commented. At the end we obtain a vector with all cited references, which could be compressed by unique().

# Bibtexkeys from bib file
keys <- c("bibkey_a", "bibkey_b", "bibkey_c", "bibkey_d",
        "Noname2000", "Ladybug1999", "Ladybug2009", "Ladybug2009a")
keys <- paste0("@", keys)

# Read document
document <- readLines("document.Rmd")

# Scan document line by line
cited_refs <- list()
for(i in 1:length(document)) {
    cited_refs[[i]] <- str_extract(document[i], keys)
}

# Final output
cited_refs <- unlist(cited_refs)
cited_refs <- cited_refs[!is.na(cited_refs)]

summary(as.factor(cited_refs))

The resulting vector can be then aggregated to know the frequency of appearance in the text (I think also useful to detect rare citations). I'm also thinking to extract the "line number" in the output.