Search code examples
pythonweb-scrapingpython-requestspaginationhtmlsession

When scraping content from a website, it seems like '&' in the URL is ignored


I have setup a web scraper, and I am now trying to add support for pagination. The URL changes as expected when going to the next page, and a '&page=X' is added to the URL, X being the page number.

When there is no more pages, increasing the page number in the URL does not result in a 404. Instead a new tag is added with a certain text, and this is the text I am going to use to determine that the function can stop.

However, when I pass the URL with the '&page=X' (using (requests_html) HTMLSession) it returns the content of the first page as if I didn't even pass the &page=x property, which I guess means that HTMLSession either ignores everything after & or there is something else going on which I don't understand.

I tried the example from the documentation but it did not work.

r = session.get('https://reddit.com')
for html in r.html:
  print(html)

<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>

(PS! Its not reddit I am trying to scrape).

URL: https://www.finn.no/car/used/search.html?model=1.8078.2000555 and pages are added with &page=X after

Can anyone help me out? I am using fake_useragent to generate a random header.


Solution

  • Yes, as you've noticed, there is no 404 when pagination reaches its end, the page just shows you a random ad.

    However, we can determine that by parsing data itself. If there are no more ads on the page, then the previous page was the last.

    I've prepared an example function below. It takes a car model id as parameter and fetches all ad URLs from all pages.

    It uses requests library (not requests_html) to fetch the page and BeautifulSoup to extract JSON with ad data.

    import json
    
    import requests
    from bs4 import BeautifulSoup
    
    HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/111.0.0.0 Safari/537.36'}
    
    
    def find_all_car_urls_for_model(model: str) -> list[str]:
        page_number = 1
        all_car_urls = []
        with requests.Session() as sess:
            # set new session and cookies
            _ = sess.get('https://www.finn.no/', headers=HEADERS)
    
            # iterate over pages
            while True:  
                r = sess.get(
                    'https://www.finn.no/car/used/search.html',
                    params={
                        'page': page_number,
                        'sort': 'PUBLISHED_DESC',
                        'model': model,
                    },
                    headers=HEADERS,
                )
    
                # extract data JSON from page
                soup = BeautifulSoup(r.text, features="html.parser")
                extracted_json = json.loads(soup.find('script', id='__NEXT_DATA__').text)
    
                found_car_urls = [
                    ad['ad_link'] for ad 
                    in extracted_json['props']['pageProps']['search']['docs']
                ]
                if found_car_urls:
                    print(f'Found {len(found_car_urls)} car URLs on page {page_number}')
                    all_car_urls.extend(found_car_urls)
                    page_number += 1
                else:
                    # if no cars were found on page then the search is completed
                    break
    
        return all_car_urls
    
    
    if __name__ == '__main__':
        result = find_all_car_urls_for_model(model='1.8078.2000555')
        print(result)  # 399 cars - as on page in browser