I am attempting to parse xml output from the NIH's pubmed system. I have already generated my URLs to parse, but the xmlParse() function appears to be adding extra " AND " text into my URLs that contain operators.
For example:
url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+m[author]+AND+science[journal]'
di <- xmlParse(url)
dl <- xmlToList(di)
This results a "NULL" IdList (where the results should be):
> dl[["IdList"]]
NULL
Checking the QueryTranslation reveals the problem (see: extra AND):
> dl[["QueryTranslation"]]
[1] "smith+m[author] AND +AND+science[journal]"
Any idea what's going on there? This is occurring with every search field or type of query that I construct that has an operator such as "AND" or "OR".
A clean parse that finds 20 papers for reference:
> url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+bm[author]'
> di <- xmlParse(url)
> dl <- xmlToList(di)
> length(dl[["IdList"]])
[1] 20
Assuming you want to do this from scratch instead of a package I mentioned above:
Use httr
first, to retrieve payload, which doesn't mess up the URL
library("XML")
library("httr")
url <- 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=smith+m[author]+AND+science[journal]'
res <- GET(url)
di <- xmlParse(content(res, "text"))
dl <- xmlToList(di)
unname(unlist(dl[["IdList"]]))
[1] "25745065" "25430773" "25395526" "25104368" "24458648" "24264993" "24052300" "23869013"
[9] "23363771" "22936773" "22116878" "21940895" "21330515" "21097923" "20966241" "20150469"
[17] "19407144" "19150811" "19119232" "19119226"