Search code examples
pythonscrapykeyerror

Scrapy KeyError: 0 parse next page


Set-up

I'm scraping this website for its housing ads, using Scrapy and Python 3.7

Per housing ad, I obtain the house's characteristics like size, price, etc.

The site displays its ads in 10 per page, so I need to iterate over pages.


My code

class DwellingSpider(scrapy.Spider):
    name = 'dwelling'

    start_urls = list(df['spitogatos_url'])

    def parse(self, response):  

        # obtain ad list element
        result_list = response.css('#searchDetailsListings')            

        # select ad from list
        for ad in result_list.xpath('div'):

            # here's the code to obtain characterstics per ad and yield them

        # obtain next page url 
        next_page = response.css('#pagination > ul > li.next > a').xpath('@href').extract_first()

        # send next page url to parse function
        if len(next_page) > 0:
            yield scrapy.Request(str(next_page), callback=self.parse)  

where list(df['spitogatos_url']) is a list containing the X urls I want to scrape and looks like,

['https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR',
 'https://en.spitogatos.gr/search/results/residential/sale/r153/m153m187m?ref=refinedSearchSR']

Issue

Obtaining the house characteristics per ad works.

The problem lies with GETting the next page correctly,

[scrapy.core.scraper] ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r194/m194m/offset_10> (referer: https://en.spitogatos.gr/search/results/residential/sale/r194/m194m?ref=refinedSearchSR)
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
    return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

I'm not sure what causes this KeyError: 0 and how to solve it.

Any ideas?


Edit

I found that if I use a next_page_url as starting point, i.e.

start_urls = ['https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10']

I immediately get the same error,

ERROR: Spider error processing <GET https://en.spitogatos.gr/search/results/residential/sale/r177/m177m183m/offset_10> (referer: None)
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4730, in get_value
    return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  File "pandas/_libs/index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

Solution

  • Try this

    next_page = response.css('[rel="next"] ::attr(href)').extract_first()
    
    if next_page:
        yield scrapy.Request(str(next_page), callback=self.parse)
    

    enter image description here