I have a list of titles of academic papers that I need to download. I would like to write a loop to download their PDF files from the web, but can't find a way to do it.
Here is the step-by-step of what I've thought so far (The answer is welcomed to be in R or Python):
# Create list with paper titles (example with 4 papers from different journals)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport",
"Reducing social and environmental impacts of urban freight transport: A review of some major cities",
"Using Lorenz curves to assess public transport equity",
"Green infrastructure: The effects of urban rail transit on air quality")
#Loop step1 - Query paper title in Google Scholar to get URL of journal webpage containing the paper
#Loop step2 - Download the PDF from the journal webpage and save in your computer
for (i in titles){
journal_URL <- query i in google (scholar)
download.file (url = journal_URL, pattern = "pdf",
destfile=paste0(i,".pdf")
}
Complicators:
Loop step1 - The first hit of Google Scholar should be the paper's original URL. However, I've heard Google Scholar is a bit fussy with Bots, so the alternative would be to query Google and get the first URL (hopping it will bring the correct URL)
Loop step2 - Some papers are gated, so I imagine that it would be necessary to include authentication info (user=__ , passwd=__). If I am using my university network, though, this authentication should be automatic, right?
ps. I only need to download the PDF. I'm not interested in getting bibliometric information (e.g. citation records, h-index). For getting bibliometric data, there is some guidance here (R users) and here (python users).
Crossref has a program where publishers can give metadata for links to full text versions of articles. Unfortunately, for publishers like Wiley, Elsevier, and Springer, they may give links, but then you need extra permissions to actually retrieve the content. Fun right? Anyway, some work, e.g., this works for your second title, search crossref, then fetch URLs for full text if provided, then grab xml, (better than PDF IMHO)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport", "Reducing social and environmental impacts of urban freight transport: A review of some major cities", "Using Lorenz curves to assess public transport equity", "Green infrastructure: The effects of urban rail transit on air quality")
library("rcrossref")
out <- cr_search(titles[2])
doi <- sub("http://dx.doi.org/", "", out$doi[1])
(links <- cr_ft_links(doi, "all"))
$xml
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/xml
$plain
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/plain
xml <- cr_ft_text(links, "xml")
library("XML")
xpathApply(xml, "//ce:author")[[1]]
<ce:author>
<ce:degrees>Prof</ce:degrees>
<ce:given-name>Eiichi</ce:given-name>
<ce:surname>Taniguchi</ce:surname>
</ce:author>