Search code examples
rxmlxpathpubmedropensci

PubMed XML parsing using entrez_fetch in rentrez


I am collecting author's information and article information for a search term in PubMed. I am getting author name, publication year and other information successfully using entrez_fetch in rentrez package. Following is my example code:

library(rentrez)
library(XML)

pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope", retmax = 100)
SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", parsed=TRUE)
First_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["ForeName"]])})
Last_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["LastName"]])})
PubYear <- xpathSApply(SearchResults, "//PubDate", function(x) {xmlValue(x[["Year"]])})
PMID <- xpathSApply(SearchResults, "//ArticleIdList", function(x) {xmlValue(x[["ArticleId"]])})

Despite getting all the information I needed, I am having an issue in figuring out which authors are for which PMID. It is because length of authors are different for each PMID. For example, if I parsed author information for 100 articles as in my code, I get more than 100 authors name and I can not associate it with respective PMID. Overall, I would like to have an output data frame like this:

 PMID       First_Name   Last_Name          PubYear
 28221147   Carlos      Torrecilla Ortiz    2017
 28221147   Sergi       Colom Feixas        2017
 28208536   Dean G      Assimos             2017
 28203551   Chad M      Gridley             2017
 28203551   Bodo E      Knudsen             2017

So this way, I would know which are authors are associated with which PMID and it useful for further analysis.

Just for the note, this is a small example of my code. I am collecting more information using XML parsing via entrez_fetch in rentrez package.

This problem is really bugging me and I would really appreciate any help or guidance. Thank you for your efforts and help in advance.


Solution

  • This is really a question about xpath (the language used to specify those nodes in an XML file), which I don't claim to be an expert on. But I think I can help in this case.

    You want to make sure that you are extracting information for one pubmed record (PubmedArticle entry) at a time. You can write a function that does that for one record

    parse_paper <- function(paper){
      last_names <- xpathSApply(paper, ".//Author/LastName", xmlValue)
      first_names <- xpathSApply(paper, ".//Author/ForeName", xmlValue)
      pmid <- xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
      data.frame(pmid=pmid, last_names=last_names, first_names=first_names)
    }
    

    That should give you one row per author, with the same pmid in each row. We can now extend that to the whole article by calling that function on each article.

    parse_multiple_papers <- function(papers){
      res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
      do.call(rbind.data.frame, res)
    }
    
    head(parse_multiple_papers(SearchResults))
    

    .

          pmid       last_names first_names
    1 28221147 Torrecilla Ortiz      Carlos
    2 28221147     Colom Feixas       Sergi
    3 28208536          Assimos      Dean G
    4 28203551          Gridley      Chad M
    5 28203551          Knudsen      Bodo E
    6 28101159               Li    Zhi-Gang
    

    BTW, I don't usually search stackoverflow, but will answer any questions about rentrez filed as issues at the github repo (they needn't be "bugs" to go there).