I tried to use this script pdfmeat to get data about papers from google scholar.
This script works very well in my pc, but when I try to put this script in my server I don't have results. I saw that is very probably that my server is in the black list of google scholar, give that I have an error (redirects to solve a chapta):
$ wget scholar.google.com
--2011-08-08 04:52:19-- http://scholar.google.com/
Resolving scholar.google.com...,,, ...
Connecting to scholar.google.com||:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.google.com/sorry/?continue=http://scholar.google.com/ [following]
--2011-08-08 04:52:24-- http://www.google.com/sorry/?continue=http://scholar.google.com/
Resolving www.google.com...,,, ...
Connecting to www.google.com||:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2011-08-08 04:52:24 ERROR 503: Service Unavailable.
Then I have found that there is an option in wget --execute "http_proxy=urltoproxy". I did that
wget -e "http_proxy=oneHttpProxy" scholar.google.com
and I could save the index.html from google scholar.
Then I tried to the same with the pdfmeat.py I don't have results neither.
this is the code:
def getWebdata(self, link, referer='http://scholar.google.com'):
useragent = 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8'
c_web = 'wget --execute "http_proxy=oneHttpProxy" -qO- --user-agent="%s" --load-cookies="%s" "%s" --referer="%s"' % (useragent, WGET_COOKIEFILE, link, referer)
c_out = os.popen(c_web)
c_txt = c_out.read()
if re.search("We're sorry", c_txt) or re.search("please type the characters", c_txt):
self.logger.critical("scholar captcha")
if not self.options.quiet:
print "PDFMEAT: scholar captcha!"
self.logger.debug("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
self.queryLog.append("getwebdata excerpt: %s" % (re.sub("\n", " ", c_txt[0:255])))
return c_txt
The script use the module os. The original function is without the --execute option for wget.
Thanks in advance
Have you tried just setting the http_proxy env. var.?
$ export http_proxy="oneHttpProxy"
$ python pdfmeat.py ....