Set-up
I'm scraping London housing ads from this site.
One can search for housing ads on 3 different area sizes: the entirety of London, in a specific district (e.g. Central London) or in a specific sub district (e.g. Aldgate).
The site only allows you to check 50 pages of each 30 ads per area, regardless the size of the area. I.e. if I select X, I can view 1500 ads in X, whether X is Central London or Aldgate.
At the moment of writing this question there are over 37.000 ads on the site.
Since I want to scrape as many ads as possible, this limitation implies I need to scrape ads on sub district level.
To do so, I have written the following spider,
# xpath to area/sub area links
area_links = ('''//*[@id="fullListings"]/div[1]/div/div/nav/aside/'''
'''section[1]/div/ul/li/a/@href''')
class ApartmentSpider(scrapy.Spider):
name = 'apartments2'
start_urls = [
"https://www.gumtree.com/property-to-rent/london"
]
# obtain links to london areas
def parse(self, response):
for url in response.xpath(area_links).extract():
yield scrapy.Request(response.urljoin(url),
callback=self.parse_sub_area)
# obtain links to london sub areas
def parse_sub_area(self, response):
for url in response.xpath(area_links).extract():
yield scrapy.Request(response.urljoin(url),
callback=self.parse_ad_overview)
# obtain ads per sub area page
def parse_ad_overview(self, response):
for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]',
).css('ul').css('li').css('a',
).xpath('@href').extract():
yield scrapy.Request(response.urljoin(ads),
callback=self.parse_ad)
next_page = response.css(
'#srp-results > div.grid-row > div > ul > li.pagination-next > a',
).xpath('@href').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
# obtain info per ad
def parse_ad(self, response):
# here follows code to extract of data per ad
which works fine.
That is, it obtains the links to the,
to finally scrape the data from each individual ad.
Problem
The code stops scraping seemingly at random, and I do not know why.
I suspect it has hit a limit as it is told to scrape many links and items, but am unsure if I'm right.
When it stops, it states,
{'downloader/request_bytes': 1295950,
'downloader/request_count': 972,
'downloader/request_method_count/GET': 972,
'downloader/response_bytes': 61697740,
'downloader/response_count': 972,
'downloader/response_status_count/200': 972,
'dupefilter/filtered': 1806,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 4, 17, 13, 35, 53156),
'item_scraped_count': 865,
'log_count/DEBUG': 1839,
'log_count/ERROR': 5,
'log_count/INFO': 11,
'request_depth_max': 2,
'response_received_count': 972,
'scheduler/dequeued': 971,
'scheduler/dequeued/memory': 971,
'scheduler/enqueued': 971,
'scheduler/enqueued/memory': 971,
'spider_exceptions/TypeError': 5,
'start_time': datetime.datetime(2017, 9, 4, 17, 9, 56, 132388)}
I'm not sure if one can read from this whether I've hit a limit or something, but if anyone does know, please let me know if I did and how to prevent the code from stopping.
Though the complete or at least partial log of crawling process is going to help you troubleshooting, but I'm going to take a risk and post this answer because I see one thing; I'm assuming is the issue
def parse_ad_overview(self, response):
for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]',
).css('ul').css('li').css('a',
).xpath('@href').extract():
yield scrapy.Request(response.urljoin(ads),
callback=self.parse_ad)
next_page = response.css(
'#srp-results > div.grid-row > div > ul > li.pagination-next > a',
).xpath('@href').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I'm pretty sure I know what's going on, ran into similar issues in the past and looking at your script when you are running your next page from the last function the Callback sends it back to parse... of which I assume that the link to next page is on thatinstances http responce... so just change the callback to parse_ad_overview ...