Search code examples
pythonpython-requestsscrapy

DEBUG: Rule at line 3 without any user agent to enforce it on Python Scrapy


I am trying to scrape content from a website using Scrapy CrawlSpider Class but I am blocked by the below response. I guess the above error has got to do with the User-Agent of my Crawler. So I had to add a custom Middleware user Agent, but the response still persist. Please I need your help, suggestions on how to resolve this.

I didn't consider using splash because the content and links to be scraped don't have a javascript extension.

My Scrapy spider class:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from datetime import datetime
import arrow
import re
import pandas as pd

class GreyhoundSpider(CrawlSpider):
    name = 'greyhound'
    allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
    start_urls = ['https://thegreyhoundrecorder.com.au/form-guides//']
    base_url =  'https://thegreyhoundrecorder.com.au'

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//tbody/tr/td[2]/a"), callback='parse_item', follow=True), #//tbody/tr/td[2]/a
    )

    def clean_date(dm):
        year = pd.to_datetime('now').year     # Get current year
        race_date =  pd.to_datetime(dm + ' ' + str(year)).strftime('%d/%m/%Y')
        return race_date

    def parse_item(self, response):
        #Field =  response.xpath ("//ul/li[1][@class='nav-item']/a/text()").extract_first() #all fileds
        for race in response.xpath("//div[@class= 'fieldsSingleRace']"):
            title = ''.join(race.xpath(".//div/h1[@class='title']/text()").extract_first())
            Track = title.split('-')[0].strip()
            date = title.split('-')[1].strip()
            final_date = self.clean_date(date)
            race_number = ''.join(race.xpath(".//tr[@id = 'tableHeader']/td[1]/text()").extract())
            num = list(race_number)
            final_race_number = "".join(num[::len(num)-1] )
            Distance = race.xpath("//tr[@id = 'tableHeader']/td[3]/text()").extract()
            TGR_Grade = race.xpath("//tr[@id = 'tableHeader']/td[4]/text()").extract()
        TGR1 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[1]/text()").extract()
        TGR2 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[2]/text()").extract()
        TGR3 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[3]/text()").extract()
        TGR4 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[4]/text()").extract()
        
        yield {
                'Track': Track,
                'Date': final_date,
                '#': final_race_number,
                'Distance': Distance,
                'TGR_Grade': TGR_Grade,
                'TGR1': TGR1,
                'TGR2': TGR2,
                'TGR3': TGR3,
                'TGR4': TGR4,
                'user-agent': response.request.headers.get('User-Agent').decode('utf-8')
              }

My custom Middleware Class:

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging

class UserAgentRotatorMiddleware(UserAgentMiddleware):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    user_agents_list = [
    
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit /537.36 KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0',
    'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)AppleWebKit /603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)AppleWebKit /601.3.9 /601.3.9 (KHTML, like Gecko)',
    'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/51.0.2704.79Safari/537.36 Edge/14.14393'

    ]

    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        try:
            self.user_agent = random.choice(self.user_agents_list)
            request.headers.setdefault('User-Agent', self.user_agent)
            
        except IndexError:
            logging.error("Couldn't fetch the user agent")

I have also changed the DOWNLOADER_MIDDLEWARES to my custom middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'greyhound_recorder_website.middlewares.UserAgentRotatorMiddleware': 400,
    
}

Set the AUTOTHROTTLE

AUTOTHROTTLE_ENABLED = True

Here is the robots.txt of the website.

User-agent: bingbot
Crawl-delay: 10

User-agent: SemrushBot
Disallow: /

User-agent: SemrushBot-SA
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: *
Disallow: /wp-admin/

Spider Response on terminal:

2021-09-24 11:52:06 [scrapy.core.engine] INFO: Spider opened
2021-09-24 11:52:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-24 11:52:06 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\8470p\Desktop\web-scraping\greyhound_recorder_website\.scrapy\httpcache
2021-09-24 11:52:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-24 11:52:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thegreyhoundrecorder.com.au/robots.txt> (referer: None) ['cached']
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 3 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.

Solution

  • The major hindrance is allowed_domains. You must have to take care on it, otherwise Crawlspider fails to produce desired output and another reason may arise to for // at the end of start_urls so you should use / and instead of allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']

    You have to only domain name like as follows:

    allowed_domains = ['thegreyhoundrecorder.com.au']
    

    Lastly, you can add your real user agent in settings.py file and it is always better practice to set robots.txt = False