Search code examples
pythonpython-requestspythonanywhere

python requests weird error on PythonAnywhere


So the following code works perfectly when I run it on my local machine in PyCharm/from shell-script:

# -*- coding: utf-8 -*-

import requests
from lxml import etree, html
import chardet

def gimme_pairs():

    url = "https://halbidoncom/sha.xml"
    page = requests.get(url).content
    encoding = chardet.detect(page)['encoding']

    if encoding != 'utf-8':
        page = page.decode(encoding, 'replace').encode('utf-8')

    doc = html.fromstring(page, base_url=url)
    print(doc)
    print(page)
    wanted = doc.xpath('//location')

    print(wanted)

    date_list = None
    tashkif_list = None

    for elem in wanted:
        date_list = elem.xpath('locationdata/timeunitdata/date/text()')
        tashkif_list = elem.xpath('locationdata/timeunitdata/element/elementvalue/text()')

But on PythonAnywhere I get this output for doc:

b'\n\n\nChallenge=355121;\nChallengeId=58551073;\nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";\n\n\nfunction test(var1)\n{\n\tvar var_str=""+Challenge;\n\tvar var_arr=var_str.split("");\n\tvar LastDig=var _arr.reverse()[0];\n\tvar minDig=var_arr.sort()[0];\n\tvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);\n\tvar subvar2 = (2 * var_arr[2])+v ar_arr[1];\n\tvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);\n\tvar x=(var1*3+subvar1)1;\n\tvar y=Math.cos(Math.PIsubvar2);\n\tvar a nswer=x*y;\n\tanswer-=my_pow*1;\n\tanswer+=(minDig*1)-(LastDig*1);\n\tanswer=answer+subvar2;\n\treturn answer;\n}\n\n\ncli ent = null;\nif (window.XMLHttpRequest)\n{\n\tvar client=new XMLHttpRequest();\n}\nelse\n{\n\tif (window.ActiveXObject)\n\t{\n\t\tclient = new ActiveXObject(\'MSXML2.XMLHTTP.3.0\');\n\t};\n}\nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))\n{\n\tdocu ment.write("Not all needed JavaScript methods are supported.
");\n\n}\nelse\n{\n\tclient.onreadystatechange = function()\n\t{\n\t\tif(c lient.readyState == 4)\n\t\t{\n\t\t\tvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");\n\t\t\tif ((MyCookie == null) || (MyCooki e==""))\n\t\t\t{\n\t\t\t\tdocument.write(client.responseText);\n\t\t\t\treturn;\n\t\t\t}\n\t\t\t\n\t\t\tvar cookieName = MyCookie.split(\'= \')[0];\n\t\t\tif (document.cookie.indexOf(cookieName)==-1)\n\t\t\t{\n\t\t\t\tdocument.write(GenericErrorMessageCookies);\n\t\t\t\treturn;\ n\t\t\t}\n\t\t\twindow.location.reload(true);\n\t\t}\n\t};\n\ty=test(Challenge);\n\tclient.open("POST",window.location,true);\n\tclient.set RequestHeader(\'X-AA-Challenge-ID\', ChallengeId);\n\tclient.setRequestHeader(\'X-AA-Challenge-Result\',y);\n\tclient.setRequestHeader(\'X- AA-Challenge\',Challenge);\n\tclient.setRequestHeader(\'Content-Type\' , \'text/plain\');\n\tclient.send();\n}\n\n\n\ nJavaScript must be enabled in order to view this page.\n\n'

Things I've tried:

  • Swapping requests for urllib.open()
  • Adding headers manually
  • ensuring same packages are installed
  • upgrading to PA premium account

What gives? what strikes me is that requests is supposed to have the same function on both my machine and theirs.


Solution

  • Looks like the servers you're trying to scrape have protection that tries to make sure you're using a real browser/there's a human behind the request. If you format that response nicely you'll see that it's setting some headers on the page using the Challenge and ChallengeId at the beginning.

    I assume the IPs/servers that PythonAnywhere uses have been added to a list by the server owners to block the requests (maybe someone really spammed them in the past?)

    Having a look around for the same headers, I've found this project which seems to have solved the same problem: https://github.com/niryariv/opentaba-server/

    They check for the challenge: https://github.com/niryariv/opentaba-server/blob/master/lib/mavat_scrape.py#L31 and parse them with this helper: https://github.com/niryariv/opentaba-server/blob/master/lib/helpers.py#L109

    Hope that helps!