Search code examples
rxmlxpathpubmedrentrez

Calculating number of xmlchildren under each parent node for a list in R


I am querying PubMED with a long list of PMIDs using R. Because entrez_fetch can only do a certain number at a time, I have broken down my ~2000 PMIDs into one list with several vectors (each about 500 in length). When I query PubMED, I am extracting information from XML files for each publication. What I would like to have in the end is something like this:

    Original.PMID     Publication.type
    26956987          Journal.article
    26956987          Meta.analysis
    26956987          Multicenter.study
    26402000          Journal.article
    25404043          Journal.article
    25404043          Meta.analysis

Each publication has a unique PMID but there may be several publication types associated with each PMID (as seen above). I can query the PMID number from the XML file, and I can get the publication types of each PMID. What I have problems with is repeating the PMID x number of times so that each PMID is associated with each of the publication type it has. I am able to do this if I don't have my data in a list with multiple sublists (e.g., if I have 14 batches, each as its own data frame) by getting the number of children nodes from the parent PublicationType node. But I can't seem to figure out how to do this for within a list.

My code so far is this:

library(rvest)
library(tidyverse)
library(stringr)
library(regexr)
library(rentrez)
library(XML)

pubmed<-my.data.frame

into.batches<-function(x,n) split(x,cut(seq_along(x),n,labels=FALSE))
batches<-into.batches(pubmed.fwd$PMID, 14)
headings<-lapply(1:14, function(x) {paste0("Batch",x)})
names(batches)<-headings
fwd<-sapply(batches, function(x) entrez_fetch(db="pubmed", id=x, rettype="xml", parsed=TRUE))
trial1<-lapply(fwd, function(x) 
  list(pub.type = xpathSApply(x, "//PublicationTypeList/PublicationType", xmlValue),
  or.pmid = xpathSApply(x, "//ArticleId[@IdType='pubmed']", xmlValue)))

trial1 is what I am having problems with. This gives me a list where within each Batch, I have a vector for pub.type and a vector for or.pmid but they're different lengths.

I am trying to figure out how many children publication types there are for each publication, so I can repeat the PMID that many number of times. I am currently using the following code which does not do what I want:

trial1<-lapply(fwd, function(x) 
  list(childnodes = xpathSApply(xmlRoot(x), "count(.//PublicationTypeList/PublicationType)", xmlChildren)))

Unfortunately, this just tells me the total number of children nodes for each batch, not for each publication (or pmid).


Solution

  • I would split the XML results into separate Article nodes and apply xpath functions to get pmids and pubtypes.

    pmids <- c(11677608, 22328765 ,11337471)
    res <- entrez_fetch(db="pubmed", rettype="xml", id = pmids)
    doc <- xmlParse(res)
    x <-  getNodeSet(doc, "//PubmedArticle")
    x1 <- sapply(x, xpathSApply, ".//ArticleId[@IdType='pubmed']", xmlValue)
    x2 <- sapply(x, xpathSApply, ".//PublicationType", xmlValue)
    data.frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
          pmid                          pubtype
    1 11677608                  Journal Article
    2 11677608 Research Support, Non-U.S. Gov't
    3 22328765                  Journal Article
    4 22328765 Research Support, Non-U.S. Gov't
    5 11337471                  Journal Article
    

    Also, NCBI says to use the HTTP POST method if using more than 200 UIDs. rentrez does not support POSTing, but you can run that with a few lines of code.

    First, you need a vector with 1000s of Pubmed IDs (6171 from the microbial genome table)

    library(readr)
    x <- read_tsv( "ftp://ftp.ncbi.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt", 
                    na = "-", quote = "")
    ids <- unique( x$`Pubmed ID` )
    ids <- ids[ids < 1e9 & !is.na(ids)]
    

    Post the ids to NCBI using httr POST.

    uri = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?"
    response <- httr::POST(uri, body= list(id = paste(ids, collapse=","), db = "pubmed"))
    

    Parse the results following the code in entrez_post to get the web history.

     doc  <-   xmlParse( httr::content(response, as="text", encoding="UTF-8") )
     result <- xpathApply(doc, "/ePostResult/*", xmlValue)
     names(result) <- c("QueryKey", "WebEnv")
     class(result) <- c("web_history", "list")
    

    Finally, fetch up to 10K records (or loop through using the retstart option if you have more than 10K)

    res <- entrez_fetch(db="pubmed", rettype="xml", web_history=result)
    doc <- xmlParse(res)
    

    These only take a second to run on my laptop.

    articles <- getNodeSet(doc, "//PubmedArticle")
    x1 <- sapply(articles, xpathSApply, ".//ArticleId[@IdType='pubmed']", xmlValue)
    x2 <- sapply(articles, xpathSApply, ".//PublicationType", xmlValue)
    
    data_frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
    # A tibble: 9,885 × 2
           pmid                                  pubtype
          <chr>                                    <chr>
     1 11677608                          Journal Article
     2 11677608         Research Support, Non-U.S. Gov't
     3 12950922                          Journal Article
     4 12950922         Research Support, Non-U.S. Gov't
     5 22328765                          Journal Article
     ...
    

    And one last comment. If you want one row per article, I usually create a function that combines multiple tags into a delimited list and adds NAs for missing nodes.

    xpath2 <-function(x, ...){
        y <- xpathSApply(x, ...)
        ifelse(length(y) == 0, NA,  paste(y, collapse="; "))
    }
    
    data_frame( pmid = sapply(articles, xpath2, ".//ArticleId[@IdType='pubmed']", xmlValue),
                journal = sapply(articles, xpath2, ".//Journal/Title", xmlValue),
               pubtypes = sapply(articles, xpath2, ".//PublicationType", xmlValue))
    
    # A tibble: 6,172 × 3
          pmid                 journal                                          pubtypes
         <chr>                   <chr>                                             <chr>
    1 11677608                  Nature Journal Article; Research Support, Non-U.S. Gov't
    2 12950922  Molecular microbiology Journal Article; Research Support, Non-U.S. Gov't
    3 22328765 Journal of bacteriology Journal Article; Research Support, Non-U.S. Gov't
    4 11337471         Genome research                                   Journal Article
    ...