I had a little script I was very pleased with that would read one or more bibliographic references from the clipboard, and get info on the academic paper from Google Scholar, and then feed it into SciHub to get the pdf. For some reason it has stopped working and I have spent ages trying to work out why.
Testing reveals that the Google (scholarly.py) part of the program is working correctly, it's the SciHub part that is the issue.
Any ideas?
Here is an example reference: Appleyard, S.J., Angeloni, J. and Watkins, R. (2006) Arsenic-rich groundwater in an urban area experiencing drought and increasing population density, Perth, Australia. Applied Geochemistry 21(1), 83-97.
'''Program to automatically find and download items from a bibliography or references list.
This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads
are listed at the end'''
import scholarly
import win32clipboard
import urllib
import urllib2
import webbrowser
import re
'''Select and then copy the bibliography entries you want to download the
papers for, python reads the clipboard'''
win32clipboard.OpenClipboard()
c = win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()
'''Cleans up the text. removes end lines and double spaces etc.'''
c = c.replace('\n', ' ')
c = c.replace('\r', ' ')
while c.find(' ') != -1:
c = c.replace(' ', ' ')
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
print "Working..."
'''bit of regex to extract the title of the paper,
IMPORTANT: bibliography has to be in
author date format or you will need to revise this,
at the moment it looks for year date in brackets, then copies all the text until it
reaches a full-stop, assuming that this is the paper title. If it is not, it
will either fail or will be using inappropriate search terms.'''
paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
print "Analysing titles"
print "The following titles found:"
print "*************************"
list_of_titles= list()
for i in paper_info:
print '%s...' % (i[3][:50])
Paper_title=str(i[3])
list_of_titles.append(Paper_title)
failed=list()
for title in list_of_titles:
try:
search_query = scholarly.search_pubs_query(title)
info= (next(search_query))
print "Querying Google Scholar"
print "**********************"
print "Looking up paper title:"
print "**********************"
print title
print "**********************"
url=info.bib['url']
print "Journal URL found "
print url
#url=next(search_query)
print "Sending URL: ", url
site='http://sci-hub.cc/'
data = urllib.urlencode({'request': url})
print data
results = urllib2.urlopen(site, data) #this is where it fails
with open("results.html", "w") as f:
f.write(results.read())
webbrowser.open_new("results.html")
except:
print "**********************"
print "No valid journal found for:"
print title
print "**********************"
print "Continuing..."
failed.append(title)
continue
if len(failed)==0:
print 'Complete'
else:
print '*************************************'
print 'The following titles did not download: '
print '*************************************'
print failed
print "Please check that these are valid entries"
This works now, I added "User-Agent" header and re-jigged the URLlib stuff. It seems more obvious what it is doing now. A process of trial and error trying lots of different snippets of code picked up from around the web. Hope my boss doesn't ask me what I've achieved today. Someone should create a forum where people can get answers to coding problems...
'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script.
Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207.
Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213.
This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads are listed at the end'''
import scholarly
import win32clipboard
import urllib
import urllib2
import webbrowser
import re
'''Select and then copy the bibliography entries you want to download the
papers for, python reads the clipboard'''
win32clipboard.OpenClipboard()
c = win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()
'''Cleans up the text. removes end lines and double spaces etc.'''
c = c.replace('\n', ' ')
c = c.replace('\r', ' ')
while c.find(' ') != -1:
c = c.replace(' ', ' ')
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
print "Working..."
'''bit of regex to extract the title of the paper,
IMPORTANT: bibliography has to be in
author date format or you will need to revise this,
at the moment it looks for date in brackets, then copies all the text until it
reaches a full-stop, assuming that this is the paper title. If it is not, it
will either fail or will be using inappropriate search terms.'''
paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
print "Analysing titles"
print "The following titles found:"
print "*************************"
list_of_titles= list()
for i in paper_info:
print '%s...' % (i[3][:50])
Paper_title=str(i[3])
list_of_titles.append(Paper_title)
paper_number=0
failed=list()
for title in list_of_titles:
try:
search_query = scholarly.search_pubs_query(title)
info= (next(search_query))
paper_number+=1
print "Querying Google Scholar"
print "**********************"
print "Looking up paper title:"
print title
print "**********************"
url=info.bib['url']
print "Journal URL found "
print url
#url=next(search_query)
print "Sending URL: ", url
site='http://sci-hub.cc/'
r = urllib2.Request(url=site)
r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')
r.add_data(urllib.urlencode({'request': url}))
res= urllib2.urlopen(r)
with open("results.html", "w") as f:
f.write(res.read())
webbrowser.open_new("results.html")
if not paper_number<= len(list_of_titles):
print "Next title"
else:
continue
except Exception as e:
print repr(e)
paper_number+=1
print "**********************"
print "No valid journal found for:"
print title
print "**********************"
print "Continuing..."
failed.append(title)
continue
if len(failed)==0:
print 'Complete'
else:
print '*************************************'
print 'The following titles did not download: '
print '*************************************'
print failed
print "Please check that these are valid entries"