I am collecting author's information and article information for a search term in PubMed. I am getting author name, publication year and other information successfully using entrez_fetch
in rentrez
package. Following is my example code:
library(rentrez)
library(XML)
pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope", retmax = 100)
SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", parsed=TRUE)
First_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["ForeName"]])})
Last_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["LastName"]])})
PubYear <- xpathSApply(SearchResults, "//PubDate", function(x) {xmlValue(x[["Year"]])})
PMID <- xpathSApply(SearchResults, "//ArticleIdList", function(x) {xmlValue(x[["ArticleId"]])})
Despite getting all the information I needed, I am having an issue in figuring out which authors are for which PMID. It is because length of authors are different for each PMID. For example, if I parsed author information for 100 articles as in my code, I get more than 100 authors name and I can not associate it with respective PMID. Overall, I would like to have an output data frame like this:
PMID First_Name Last_Name PubYear
28221147 Carlos Torrecilla Ortiz 2017
28221147 Sergi Colom Feixas 2017
28208536 Dean G Assimos 2017
28203551 Chad M Gridley 2017
28203551 Bodo E Knudsen 2017
So this way, I would know which are authors are associated with which PMID and it useful for further analysis.
Just for the note, this is a small example of my code. I am collecting more information using XML
parsing via entrez_fetch
in rentrez
package.
This problem is really bugging me and I would really appreciate any help or guidance. Thank you for your efforts and help in advance.
This is really a question about xpath (the language used to specify those nodes in an XML file), which I don't claim to be an expert on. But I think I can help in this case.
You want to make sure that you are extracting information for one pubmed record (PubmedArticle
entry) at a time. You can write a function that does that for one record
parse_paper <- function(paper){
last_names <- xpathSApply(paper, ".//Author/LastName", xmlValue)
first_names <- xpathSApply(paper, ".//Author/ForeName", xmlValue)
pmid <- xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
data.frame(pmid=pmid, last_names=last_names, first_names=first_names)
}
That should give you one row per author, with the same pmid in each row. We can now extend that to the whole article by calling that function on each article.
parse_multiple_papers <- function(papers){
res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
do.call(rbind.data.frame, res)
}
head(parse_multiple_papers(SearchResults))
.
pmid last_names first_names
1 28221147 Torrecilla Ortiz Carlos
2 28221147 Colom Feixas Sergi
3 28208536 Assimos Dean G
4 28203551 Gridley Chad M
5 28203551 Knudsen Bodo E
6 28101159 Li Zhi-Gang
BTW, I don't usually search stackoverflow, but will answer any questions about rentrez
filed as issues at the github repo (they needn't be "bugs" to go there).