Search code examples
web-scrapingscrapyreddit

AttributeError: 'NoneType' object has no attribute 'css'. Trying to scrape old reddit but geting this error


I'm trying to scrape the old reddit but every time I get this error:

>>> response.css('div')

Traceback (most recent call last):

File "<console>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'css'

Am I doing something wrong or can you not scrape the old reddit?

This is the log:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://old.reddit.com/robots.txt> (referer: None)
2020-11-02 14:56:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://old.reddit.com/> from <GET http://old.reddit.com>
2020-11-02 14:56:09 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://old.reddit.com/>

Solution

  • You are getting this error because you received an empty response (None). So you are trying to call the .css() method in a empty variable. The reason why you received None and not the expected response object is because your spider filtered the request.

    You can see in this line of your execution log:

    2020-11-02 14:56:09 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://old.reddit.com/>
    

    The requested URL is not allowed by the site's robots.txt. You can disable this filter by changing in your spider's settings.py in the line ROBOTSTXT_OBEY. To disable it use:

    ROBOTSTXT_OBEY = False
    

    This will cause your spider to ignore robots.txt for ALL requests. (Read more)

    However respecting the robots.txt rules is considered a good practice (even ethical one may say) in webscraping. More details on the robots.txt standard here.