I'm trying to scrape the old reddit but every time I get this error:
>>> response.css('div')
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'css'
Am I doing something wrong or can you not scrape the old reddit?
This is the log:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://old.reddit.com/robots.txt> (referer: None)
2020-11-02 14:56:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://old.reddit.com/> from <GET http://old.reddit.com>
2020-11-02 14:56:09 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://old.reddit.com/>
You are getting this error because you received an empty response (None
). So you are trying to call the .css()
method in a empty variable. The reason why you received None
and not the expected response object is because your spider filtered the request.
You can see in this line of your execution log:
2020-11-02 14:56:09 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://old.reddit.com/>
The requested URL is not allowed by the site's robots.txt
. You can disable this filter by changing in your spider's settings.py
in the line ROBOTSTXT_OBEY
. To disable it use:
ROBOTSTXT_OBEY = False
This will cause your spider to ignore robots.txt
for ALL requests. (Read more)
However respecting the robots.txt
rules is considered a good practice (even ethical one may say) in webscraping. More details on the robots.txt
standard here.