Search code examples
pythonpython-requestsgeturllib

Python program times-out when hitting this website


Why does this function fail to read XML from "https://www.seattletimes.com/feed/"?

I can visit the URL from my browser just fine. It also reads XML from other websites without a problem ("https://news.ycombinator.com/rss").

import urllib


def get_url(u):
    header = {'User-Agent': 'Mozilla/5.0'}
    request = urllib.request.Request(url=url, headers=header)
    response = urllib.request.urlopen(request)
    return response.read().decode('utf-8')

url = 'https://www.seattletimes.com/feed/'

feed = get_url(url)

print(feed)

The program times out every time.

Ideas?:

  • Maybe header need more info (Accept, etc.)?

EDIT1:

I replaced with the request header from the script with my browser header. Still no-go.

header = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,en;q=0.9',
    'Connection': 'keep-alive',
    'Accept-Language': 'en-US,en;q=0.9',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' }

Solution

  • I am not quite sure why but the header/user-agent was confusing the website. If you remove it your code works just fine. I've tried different header arguments without issues, the user-agent seems to be what causes that behaviour.

    import urllib.request
    
    
    def get_url(u):
        request = urllib.request.Request(url=url)
        response = urllib.request.urlopen(request)
        return response.read().decode('utf-8')
    
    url = 'https://www.seattletimes.com/feed/'
    
    feed = get_url(url)
    
    print(feed)
    

    After some debugging I have found a legal header combination (keep in mind I consider this a bug on their end):

      header = {
            'User-Agent': 'Mozilla/5.0',
            'Cookie': 'PHPSESSID=kfdkdofsdj99g36l443862qeq2',
            'Accept-Language': "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7",}