Search code examples
xmlrxml-parsingpubmed

R: xmlParse incorrectly "adding" extra " AND " to URL link, parsing fails


I am attempting to parse xml output from the NIH's pubmed system. I have already generated my URLs to parse, but the xmlParse() function appears to be adding extra " AND " text into my URLs that contain operators.

For example:

url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+m[author]+AND+science[journal]'
di <- xmlParse(url)
dl <- xmlToList(di)

This results a "NULL" IdList (where the results should be):

> dl[["IdList"]]
NULL

Checking the QueryTranslation reveals the problem (see: extra AND):

> dl[["QueryTranslation"]]
[1] "smith+m[author] AND +AND+science[journal]"

Any idea what's going on there? This is occurring with every search field or type of query that I construct that has an operator such as "AND" or "OR".

A clean parse that finds 20 papers for reference:

> url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+bm[author]'
> di <- xmlParse(url)
> dl <- xmlToList(di)
> length(dl[["IdList"]])
[1] 20

Solution

  • Assuming you want to do this from scratch instead of a package I mentioned above:

    Use httr first, to retrieve payload, which doesn't mess up the URL

    library("XML")
    library("httr")
    url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+m[author]+AND+science[journal]'
    res <- GET(url)
    di <- xmlParse(content(res, "text"))
    dl <- xmlToList(di)
    unname(unlist(dl[["IdList"]]))
    
    [1] "25745065" "25430773" "25395526" "25104368" "24458648" "24264993" "24052300" "23869013"
    [9] "23363771" "22936773" "22116878" "21940895" "21330515" "21097923" "20966241" "20150469"
    [17] "19407144" "19150811" "19119232" "19119226"