Search code examples
xmlrxpathncbipubmed

How to access values of sub-nodes (child) with different names in XML file?


I am trying to parse xmlValue of certain child nodes from NCBI xml file. But, for some PM.IDs, the Root node <PubmedArticleSet> has different information w.r.t pubmed records, PubmedBookArticle and PubmedArticle. I would like to pass a condition, if(xmlName(fetch.pubmed) == PubmedBookArticle extract certain valueselseif (xmlName(fetch.pubmed) == PubmedArticle extract other values. Finally, make a dataframe with both the values corresponding to their PMIDs. It seems simple, but (xmlName(fetch.pubmed) throws error no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')" Any help is appreciated, thank you

<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd">
<PubmedArticleSet>
  <PubmedBookArticle>
    <BookDocument>
      <PMID Version="1">25506969</PMID>
      <ArticleIdList>
        <ArticleId IdType="bookaccession">NBK259188</ArticleId>
      </ArticleIdList> ....

   ...... </BookDocument>
  </PubmedBookArticle>

  <PubmedArticle>
    <MedlineCitation Status="Publisher" Owner="NLM">
      <PMID Version="1">25013473</PMID>
      <DateCreated>
        <Year>2014</Year>
        <Month>7</Month>
        <Day>11</Day>
      </DateCreated>....

    ....</MedlineCitation>
    </PubmedArticle>
</PubmedArticleSet>

My code is below

library(XML)
library(rentrez)

PM.ID <- c("25506969"," 25032371","   24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
 "25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
                             rettype = "xml", parsed = T)
# If empty records, return NA
FindNull <- function(x,x1child){
  res <- xpathSApply(x,x1child,xmlValue)
  if (length(res) == 0){
    out <- NA
  }else {
    out <- res
  }
  out
}

# extract contents from xml file
    xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle')

    xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle')

How do I get above code in a loop, so that I can retrieve values within PubmedArticle and PubmedBookArticle as an when the condition is met in each search ?


Solution

  • There are a few ways you could do this, but I would maybe get separate node sets for books and articles.

    table( xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName) )
        PubmedArticle PubmedBookArticle 
                    6                 6 
    
    books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle")
    
    data.frame( pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)),
               title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue))
    )
    
          pmid                                                                                                      title
    1 25506969                                                     Probe Reports from the NIH Molecular Libraries Program
    2 25032371                                                       Understanding Climate’s Influence on Human Evolution
    3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop
    4 24983034                                                  In the Light of Evolution: Volume IV: The Human Condition
    5 24983032                                            The Role of Human Factors in Home Health Care: Workshop Summary