Search code examples
pythonpython-3.xweb-scrapingscrapy

Scrapy. Handling Pagination


I'm using scrapy to collect data from habermeyer.de. Although it's easy to iterate over categories and products, I can't find the right way to preserve pagination. If we inspect the pagination mechanism in a web browser, we see that each time we press the button to view more items, we actually send a POST request with some form data, so it returns HTML with the new products. Moreover the required form data is injected into data-search-params attribute of the button, so it can be easily extracted and serialized into JSON.

Let's say we have a category. For the experiment, I copied the form data from the Chrome's Developer Tools, while interacting with the pagination manually, and pasted it into the script bellow, which I use in the scrapy shell:

photo-5321041589130348210-w.jpg

from scrapy.http import FormRequest


pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
form_data = {
  'factFinderSearchParameters': {
    'filters': [
      {
        'name': 'CategoryPath',
        'substring': False,
        'values': [{'exclude': False, 'type': 'or', 'value': ['Rennbahnen, RC & Modellbau']}]
      }  
    ],
    'hitsPerPage': 24,
    'marketIds': ['400866330'],
    'page': 3,
    'query': '*'
  },
  'useAsn': '0'
}
headers = {
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Origin": "https://www.habermeyer.de",
    "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
}
r = FormRequest(pagination_api_url, formdata=form_data, headers=headers)
# fetch(r)

Note: I had to convert the value of useAsn into str in order to avoid avoid TypeError: to_bytes must receive a str or bytes object, got int.

Though fetching the form request returns HTTP 200, the content of the returned HTML indicates that search returned no results.

As another experiment, I copied the encoded form data from the Chrome's Developer Tools and passed it into a simple POST request (see the code below). As a result I received the expected HTML output with the new products:

from scrapy import Request


encoded_form_data = "factFinderSearchParameters=%7B%22filters%22%3A%5B%7B%22name%22%3A%22CategoryPath%22%2C%22substring%22%3Afalse%2C%22values%22%3A%5B%7B%22exclude%22%3Afalse%2C%22type%22%3A%22or%22%2C%22value%22%3A%5B%22Rennbahnen%2C+RC+%26+Modellbau%22%5D%7D%5D%7D%5D%2C%22hitsPerPage%22%3A24%2C%22marketIds%22%3A%5B%22400866330%22%5D%2C%22page%22%3A3%2C%22query%22%3A%22*%22%7D&useAsn=0"
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

Encoding the initial form data represented as JSON, doesn't help as well, though the request returns HTTP 200:

from urllib.parse import urlencode


encoded_form_data = urlencode(form_data)
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

Python version: 3.10.6
Scrapy version: 2.8.0


Solution

  • This should do it.

    from scrapy.crawler import CrawlerProcess
    import scrapy
    import json
    
    
    class DemoSpider(scrapy.Spider):
        name = 'habermeyer'
        
        pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
    
        headers = {
            "X-Requested-With": "XMLHttpRequest",
            "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
        }
        querystring = {"filters":[{"name":"CategoryPath","substring":"false","values":[{"exclude":"false","type":"or","value":["Rennbahnen, RC & Modellbau"]}]}],"hitsPerPage":24,"marketIds":["400866330"],"page":1,"query":"*"}
        
        form_data = {
            'factFinderSearchParameters': json.dumps(querystring),
            'useAsn': '0'
        }
    
        def start_requests(self):
            yield scrapy.FormRequest(
                self.pagination_api_url, 
                method="POST", 
                formdata=self.form_data, 
                headers=self.headers,
                callback=self.parse
            )
    
        def parse(self,response):
            if not response.css(".searchResultInformation"):
                return
    
            for item in response.css(".searchResultInformation::text").getall():
                yield {"title": item.strip()}
    
    
            self.querystring['page'] = self.querystring['page']+1
            
            self.form_data = {
                'factFinderSearchParameters': json.dumps(self.querystring),
                'useAsn': '0'
            }
    
            yield scrapy.FormRequest(
                self.pagination_api_url, 
                method="POST", 
                formdata=self.form_data, 
                headers=self.headers,
                callback=self.parse,
                dont_filter=True
            )
    
    
    if __name__ == "__main__":
        process = CrawlerProcess()
        process.crawl(DemoSpider)
        process.start()