Search code examples
pythonhttpweb-scrapinghttpspython-requests

Python Requests GET with Proxy - HTTPS scheme returns expected result but HTTP returns header


When setting any URL to use HTTPS as the scheme (i.e., https://), I get my desired response (i.e., page source), but any http url (i.e., http://) fails or I receive a header and I don't understand why when I expect redirection to the page source. This is important because sometimes the urls I'm processing are http:// or https:// and I need them to redirect appropriately.

Attempt #1 - Link is http:// and using https-based proxy. However, the result is the same for both http & https proxies. The proxies are public.

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent(browsers=['Edge', 'Chrome', 'Firefox', 'Google'], os='Windows', platforms='desktop')
headers = {
    'Accept': 'application/json',
    'User-Agent': ua.random, # generic user agent
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'Connection': 'keep-alive',
    }

htmlRequest = requests.get("http://link.springer.com/10.1023/A:1012637309336", # Another example link that presents the same behavior - https://ieeexplore.ieee.org/document/10152818/
    headers=headers,
    verify=False, # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus.
    #verify="springer-com-chain.pem", # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus. This file is downloaded directly from the link in the get request above.
    allow_redirects=True,
    #proxies={"http": "http://3.21.101.158:3128"},
    proxies={"http": "https://204.236.176.61:3128"},
    timeout=30)

print(f"Status Code: {htmlRequest.status_code}")
print(f"URL History: {htmlRequest.history}\n")
soup = BeautifulSoup(htmlRequest.content, 'html.parser')
print(soup.prettify())

Attempt #1 Error

Status Code: 200
URL History: []

REMOTE_ADDR = 13.56.247.133
REMOTE_PORT = 56719
REQUEST_METHOD = GET
REQUEST_URI = http://link.springer.com/10.1023/A:1012637309336
REQUEST_TIME_FLOAT = 1739322382.2113674
REQUEST_TIME = 1739322382
HTTP_HOST = link.springer.com
HTTP_USER-AGENT = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
HTTP_ACCEPT-ENCODING = gzip, deflate
HTTP_ACCEPT = application/json
HTTP_CONNECTION = keep-alive
HTTP_ACCEPT-LANGUAGE = en-GB,en-US;q=0.9,en;q=0.8

The first line is the status code. We see 200 response, but the next line shows no redirect history. If I go into the browser, it'll automatically redirect to https://. I understand the stacks are different, but what is missing, especially since requests is supposed to handle redirects. What do I do with this header? Why am I receiving it? I could just manually make sure every url in the GET request is https://, but I wouldn't understand, why this is an issue.

Attempt #2 works and returns the page source (change the URL from http to https) and shows the various redirects

...
htmlRequest = requests.get("https://link.springer.com/10.1023/A:1012637309336",
...

Thank you kindly and hopefully this is helpful for others!


Solution

  • It looks like the problem is in your proxy. The response you think of as some headers is the response you would get if you request the proxy address.

    Change the proxy and the problem will be solved.

    If you do

    htmlRequest = requests.get("https://204.236.176.61:3128", verify=False)
    

    Then you'll get the Attempt#1 response.

    I've tried your code with another proxy and it works as expected.

    import requests
    from bs4 import BeautifulSoup
    from fake_useragent import UserAgent
    
    ua = UserAgent(browsers=['Edge', 'Chrome', 'Firefox', 'Google'], os='Windows', platforms='desktop')
    headers = {
        'Accept': 'application/json',
        'User-Agent': ua.random, # generic user agent
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        }
    
    htmlRequest = requests.get("http://link.springer.com/10.1023/A:1012637309336", # Another example link that presents the same behavior - https://ieeexplore.ieee.org/document/10152818/
        headers=headers,
        verify=False, # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus.
        #verify="springer-com-chain.pem", # verify is necessary for https proxy, or I'll receive an "Cannot set verify_mode to CERT_NONE when check_hostname is enabled" error. Either solution works, but not the focus. This file is downloaded directly from the link in the get request above.
        allow_redirects=True,
        #proxies={"http": "http://3.21.101.158:3128"},
        # proxies={"http": "https://204.236.176.61:3128"},
        proxies={"http": "http://185.162.231.250"},
        timeout=30)
    
    print(f"Status Code: {htmlRequest.status_code}")
    print(f"URL History: {htmlRequest.history}\n")
    soup = BeautifulSoup(htmlRequest.content, 'html.parser')
    print(soup.prettify())