Search code examples
scrapyscrapy-request

My callback function is not responding, can anyone help out here i'm really stuck


I tried running this code several times using pycharm but it just won't work. The scrapy Request callback function is not responding, nothing gets printed out. Does anyone have an idea on what's causing the bug??

import scrapy


class HemnetSpider(scrapy.Spider):
    name = "hemnet"
    allowed_domains = ["hemnet.se/"]
    start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]

    def parse(self, response):

        for links in response.css('ul.normal-results > li.normal-results__hit > a::attr("href")'):

            yield scrapy.Request(url=links.get(), callback=self.parseInnerPage)

    def parseInnerPage(self, response):
        print(response.text)

Solution

  • The issue is caused by the value you have in your allowed_domains. This is apparent by looking at the log output produced by scrapy while running your spider.

    For example when I run your spider it shows.

    2023-04-29 20:17:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hemnet.se/bostader?location_ids%5B%5D=17759> (referer: None)
    2023-04-29 20:17:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hemnet.se': <GET https://www.hemnet.se/bostad/lagenhet-3rum-stoten-malung-salens-kommun-nordklint-57-18902155>
    2023-04-29 20:17:37 [scrapy.core.engine] INFO: Closing spider (finished)
    

    What the above is saying is that it crawled the initial page successfully, however none of the collected links that you yield new requests for match the any of the domains you have listed in your allowed_domains attribute.

    This can be solved by either removing the allowed_domains attribute, or editing it to ["www.hemnet.se"].

    for example:

    class HemnetSpider(scrapy.Spider):
        name = "hemnet"
        allowed_domains = ["www.hemnet.se"]
        start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]
    
        ...
    

    After making the above changes and running your spider, the output prints the full html for a multitude of pages as expected.