Scrapy. Handling Pagination

I'm using scrapy to collect data from habermeyer.de. Although it's easy to iterate over categories and products, I can't find the right way to preserve pagination. If we inspect the pagination mechanism in a web browser, we see that each time we press the button to view more items, we actually send a POST request with some form data, so it returns HTML with the new products. Moreover the required form data is injected into data-search-params attribute of the button, so it can be easily extracted and serialized into JSON.

Let's say we have a category. For the experiment, I copied the form data from the Chrome's Developer Tools, while interacting with the pagination manually, and pasted it into the script bellow, which I use in the scrapy shell:

from scrapy.http import FormRequest


pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
form_data = {
  'factFinderSearchParameters': {
    'filters': [
      {
        'name': 'CategoryPath',
        'substring': False,
        'values': [{'exclude': False, 'type': 'or', 'value': ['Rennbahnen, RC & Modellbau']}]
      }  
    ],
    'hitsPerPage': 24,
    'marketIds': ['400866330'],
    'page': 3,
    'query': '*'
  },
  'useAsn': '0'
}
headers = {
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Origin": "https://www.habermeyer.de",
    "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
}
r = FormRequest(pagination_api_url, formdata=form_data, headers=headers)
# fetch(r)

Note: I had to convert the value of useAsn into str in order to avoid avoid TypeError: to_bytes must receive a str or bytes object, got int.

Though fetching the form request returns HTTP 200, the content of the returned HTML indicates that search returned no results.

As another experiment, I copied the encoded form data from the Chrome's Developer Tools and passed it into a simple POST request (see the code below). As a result I received the expected HTML output with the new products:

from scrapy import Request


encoded_form_data = "factFinderSearchParameters=%7B%22filters%22%3A%5B%7B%22name%22%3A%22CategoryPath%22%2C%22substring%22%3Afalse%2C%22values%22%3A%5B%7B%22exclude%22%3Afalse%2C%22type%22%3A%22or%22%2C%22value%22%3A%5B%22Rennbahnen%2C+RC+%26+Modellbau%22%5D%7D%5D%7D%5D%2C%22hitsPerPage%22%3A24%2C%22marketIds%22%3A%5B%22400866330%22%5D%2C%22page%22%3A3%2C%22query%22%3A%22*%22%7D&useAsn=0"
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

Encoding the initial form data represented as JSON, doesn't help as well, though the request returns HTTP 200:

from urllib.parse import urlencode


encoded_form_data = urlencode(form_data)
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

Python version: 3.10.6
Scrapy version: 2.8.0

Solution

This should do it.

from scrapy.crawler import CrawlerProcess
import scrapy
import json


class DemoSpider(scrapy.Spider):
    name = 'habermeyer'
    
    pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"

    headers = {
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
    }
    querystring = {"filters":[{"name":"CategoryPath","substring":"false","values":[{"exclude":"false","type":"or","value":["Rennbahnen, RC & Modellbau"]}]}],"hitsPerPage":24,"marketIds":["400866330"],"page":1,"query":"*"}
    
    form_data = {
        'factFinderSearchParameters': json.dumps(querystring),
        'useAsn': '0'
    }

    def start_requests(self):
        yield scrapy.FormRequest(
            self.pagination_api_url, 
            method="POST", 
            formdata=self.form_data, 
            headers=self.headers,
            callback=self.parse
        )

    def parse(self,response):
        if not response.css(".searchResultInformation"):
            return

        for item in response.css(".searchResultInformation::text").getall():
            yield {"title": item.strip()}


        self.querystring['page'] = self.querystring['page']+1
        
        self.form_data = {
            'factFinderSearchParameters': json.dumps(self.querystring),
            'useAsn': '0'
        }

        yield scrapy.FormRequest(
            self.pagination_api_url, 
            method="POST", 
            formdata=self.form_data, 
            headers=self.headers,
            callback=self.parse,
            dont_filter=True
        )


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(DemoSpider)
    process.start()