So the following code works perfectly when I run it on my local machine in PyCharm/from shell-script:
# -*- coding: utf-8 -*-
import requests
from lxml import etree, html
import chardet
def gimme_pairs():
url = "https://halbidoncom/sha.xml"
page = requests.get(url).content
encoding = chardet.detect(page)['encoding']
if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(page, base_url=url)
print(doc)
print(page)
wanted = doc.xpath('//location')
print(wanted)
date_list = None
tashkif_list = None
for elem in wanted:
date_list = elem.xpath('locationdata/timeunitdata/date/text()')
tashkif_list = elem.xpath('locationdata/timeunitdata/element/elementvalue/text()')
But on PythonAnywhere I get this output for doc
:
b'\n\n\nChallenge=355121;\nChallengeId=58551073;\nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";\n\n\nfunction test(var1)\n{\n\tvar var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar LastDig=var _arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+v ar_arr[1];\n\tvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar x=(var1*3+subvar1)1;\n\tvar y=Math.cos(Math.PIsubvar2);\n\tvar a nswer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn answer;\n}\n\n\ncli ent = null;\nif (window.XMLHttpRequest)\n{\n\tvar client=new XMLHttpRequest();\n}\nelse\n{\n\tif (window.ActiveXObject)\n\t{\n\t\tclient = new ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocu ment.write("Not all needed JavaScript methods are supported.
");\n\n}\nelse\n{\n\tclient.onreadystatechange = function()\n\t{\n\t\tif(c lient.readyState == 4)\n\t\t{\n\t\t\tvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif ((MyCookie == null) || (MyCooki e==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar cookieName = MyCookie.split(\'= \')[0];\n\t\t\tif (document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\ n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.set RequestHeader(\'X-AA-Challenge-ID\', ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X- AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\' , \'text/plain\');\n\tclient.send();\n}\n\n\n\ nJavaScript must be enabled in order to view this page.\n\n'
Things I've tried:
What gives? what strikes me is that requests is supposed to have the same function on both my machine and theirs.
Looks like the servers you're trying to scrape have protection that tries to make sure you're using a real browser/there's a human behind the request. If you format that response nicely you'll see that it's setting some headers on the page using the Challenge
and ChallengeId
at the beginning.
I assume the IPs/servers that PythonAnywhere uses have been added to a list by the server owners to block the requests (maybe someone really spammed them in the past?)
Having a look around for the same headers, I've found this project which seems to have solved the same problem: https://github.com/niryariv/opentaba-server/
They check for the challenge: https://github.com/niryariv/opentaba-server/blob/master/lib/mavat_scrape.py#L31 and parse them with this helper: https://github.com/niryariv/opentaba-server/blob/master/lib/helpers.py#L109
Hope that helps!