python web-scraping python-requests pagination htmlsession

When scraping content from a website, it seems like '&' in the URL is ignored

I have setup a web scraper, and I am now trying to add support for pagination. The URL changes as expected when going to the next page, and a '&page=X' is added to the URL, X being the page number.

When there is no more pages, increasing the page number in the URL does not result in a 404. Instead a new tag is added with a certain text, and this is the text I am going to use to determine that the function can stop.

However, when I pass the URL with the '&page=X' (using (requests_html) HTMLSession) it returns the content of the first page as if I didn't even pass the &page=x property, which I guess means that HTMLSession either ignores everything after & or there is something else going on which I don't understand.

I tried the example from the documentation but it did not work.

r = session.get('https://reddit.com')
for html in r.html:
  print(html)

<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>

(PS! Its not reddit I am trying to scrape).

URL: https://www.finn.no/car/used/search.html?model=1.8078.2000555 and pages are added with &page=X after

Can anyone help me out? I am using fake_useragent to generate a random header.

Solution

Yes, as you've noticed, there is no 404 when pagination reaches its end, the page just shows you a random ad.

However, we can determine that by parsing data itself. If there are no more ads on the page, then the previous page was the last.

I've prepared an example function below. It takes a car model id as parameter and fetches all ad URLs from all pages.

It uses requests library (not requests_html) to fetch the page and BeautifulSoup to extract JSON with ad data.

import json

import requests
from bs4 import BeautifulSoup

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/111.0.0.0 Safari/537.36'}


def find_all_car_urls_for_model(model: str) -> list[str]:
    page_number = 1
    all_car_urls = []
    with requests.Session() as sess:
        # set new session and cookies
        _ = sess.get('https://www.finn.no/', headers=HEADERS)

        # iterate over pages
        while True:  
            r = sess.get(
                'https://www.finn.no/car/used/search.html',
                params={
                    'page': page_number,
                    'sort': 'PUBLISHED_DESC',
                    'model': model,
                },
                headers=HEADERS,
            )

            # extract data JSON from page
            soup = BeautifulSoup(r.text, features="html.parser")
            extracted_json = json.loads(soup.find('script', id='__NEXT_DATA__').text)

            found_car_urls = [
                ad['ad_link'] for ad 
                in extracted_json['props']['pageProps']['search']['docs']
            ]
            if found_car_urls:
                print(f'Found {len(found_car_urls)} car URLs on page {page_number}')
                all_car_urls.extend(found_car_urls)
                page_number += 1
            else:
                # if no cars were found on page then the search is completed
                break

    return all_car_urls


if __name__ == '__main__':
    result = find_all_car_urls_for_model(model='1.8078.2000555')
    print(result)  # 399 cars - as on page in browser