Search code examples
web-scrapingscrapyheader

When do I have to set headers and how do I get them?


I am trying to crawl some information from www.blogabet.com.

In the mean time, I am attending a course at udemy about webcrawling. The author of the course I am enrolled in already gave me the answer to my problem. However, I do not fully understand why I have to do the specific steps he mentioned. You can find his code bellow.

I am asking myself: 1. For which websites do I have to use headers? 2. How do I get the information that I have to provide in the header? 3. How do I get the url he fetches? Basically, I just wanted to fetch: https://blogabet.com/tipsters

Thank you very much :)


scrapy shell

from scrapy import Request
url = 'https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=0'

page = Request(url,
                headers={'Accept': '*/*',
                         'Accept-Encoding': 'gzip, deflate, br',
                         'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
                         'Connection': 'keep-alive',
                         'Host': 'blogabet.com',
                         'Referer': 'https://blogabet.com/tipsters',
                         'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
                         'X-Requested-With': 'XMLHttpRequest'})

fetch(page)

Solution

  • If you look in your network panel when you load that page you can see the XHR and the headers it sends

    xhr So it looks like he just copied those.

    In general you can skip everything except User-Agent and you want to avoid setting Host, Connection and Accept headers unless you know what you're doing.