I'm using scrapy to collect data from habermeyer.de. Although it's easy to iterate over categories and products, I can't find the right way to preserve pagination. If we inspect the pagination mechanism in a web browser, we see that each time we press the button to view more items, we actually send a POST request with some form data, so it returns HTML with the new products. Moreover the required form data is injected into data-search-params attribute of the button, so it can be easily extracted and serialized into JSON.
Let's say we have a category. For the experiment, I copied the form data from the Chrome's Developer Tools, while interacting with the pagination manually, and pasted it into the script bellow, which I use in the scrapy shell:
from scrapy.http import FormRequest
pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
form_data = {
'factFinderSearchParameters': {
'filters': [
{
'name': 'CategoryPath',
'substring': False,
'values': [{'exclude': False, 'type': 'or', 'value': ['Rennbahnen, RC & Modellbau']}]
}
],
'hitsPerPage': 24,
'marketIds': ['400866330'],
'page': 3,
'query': '*'
},
'useAsn': '0'
}
headers = {
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Origin": "https://www.habermeyer.de",
"Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
}
r = FormRequest(pagination_api_url, formdata=form_data, headers=headers)
# fetch(r)
Note: I had to convert the value of useAsn
into str in order to avoid avoid TypeError: to_bytes must receive a str or bytes object, got int
.
Though fetching the form request returns HTTP 200, the content of the returned HTML indicates that search returned no results.
As another experiment, I copied the encoded form data from the Chrome's Developer Tools and passed it into a simple POST request (see the code below). As a result I received the expected HTML output with the new products:
from scrapy import Request
encoded_form_data = "factFinderSearchParameters=%7B%22filters%22%3A%5B%7B%22name%22%3A%22CategoryPath%22%2C%22substring%22%3Afalse%2C%22values%22%3A%5B%7B%22exclude%22%3Afalse%2C%22type%22%3A%22or%22%2C%22value%22%3A%5B%22Rennbahnen%2C+RC+%26+Modellbau%22%5D%7D%5D%7D%5D%2C%22hitsPerPage%22%3A24%2C%22marketIds%22%3A%5B%22400866330%22%5D%2C%22page%22%3A3%2C%22query%22%3A%22*%22%7D&useAsn=0"
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)
Encoding the initial form data represented as JSON, doesn't help as well, though the request returns HTTP 200:
from urllib.parse import urlencode
encoded_form_data = urlencode(form_data)
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)
Python version: 3.10.6
Scrapy version: 2.8.0
This should do it.
from scrapy.crawler import CrawlerProcess
import scrapy
import json
class DemoSpider(scrapy.Spider):
name = 'habermeyer'
pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
headers = {
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
}
querystring = {"filters":[{"name":"CategoryPath","substring":"false","values":[{"exclude":"false","type":"or","value":["Rennbahnen, RC & Modellbau"]}]}],"hitsPerPage":24,"marketIds":["400866330"],"page":1,"query":"*"}
form_data = {
'factFinderSearchParameters': json.dumps(querystring),
'useAsn': '0'
}
def start_requests(self):
yield scrapy.FormRequest(
self.pagination_api_url,
method="POST",
formdata=self.form_data,
headers=self.headers,
callback=self.parse
)
def parse(self,response):
if not response.css(".searchResultInformation"):
return
for item in response.css(".searchResultInformation::text").getall():
yield {"title": item.strip()}
self.querystring['page'] = self.querystring['page']+1
self.form_data = {
'factFinderSearchParameters': json.dumps(self.querystring),
'useAsn': '0'
}
yield scrapy.FormRequest(
self.pagination_api_url,
method="POST",
formdata=self.form_data,
headers=self.headers,
callback=self.parse,
dont_filter=True
)
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(DemoSpider)
process.start()