Search code examples
pythonhtmlweb-scrapingbeautifulsoup

Web Scraping, load more data into a HTML


I am looking to scrape data from the following web page https://www.racingpost.com/bloodstock/sales/catalogues/5/2023-12-04

I am using beatiful soup and have requested the html as seen below. The soup returned contains 50 rows of data, which correspond to the first 50 rows you see on the website when you first load the HTML. However I dont know how to access the next 950+ rows of data thats avaible when you click on the bar below the table of data. I would like to scrape the data for the hidden rows of data as well. The reqex returns the rows of data I want in a string, anyone who runs the below code should get 50 strings returned in the list of raw_data, I am looking to get 1065 rows of data. Any help would be greatly appreciated.

from bs4 import BeautifulSoup as BS
from bs4 import Comment
import requests
import numpy as np
import time
import re



url = 'https://www.racingpost.com/bloodstock/sales/catalogues/5/2023-12-04'

res = requests.get(url)

print(res.status_code)

soup = BS(res.content, "html.parser")

    
string = str(soup.find_all("script", type="text/javascript")[1])


raw_data = re.findall(r'\{"lot(.*?)\}', string)


Solution

  • You don't need to parse HTML at all, just load all the pages:

    import requests
    
    page_metadata = requests.get('https://www.racingpost.com/bloodstock/sales/catalogues/5/2023-12-04/data.json').json()
    totalPages = page_metadata['pagination']['totalPages']
    
    for page in range(1, totalPages + 1):
        response = requests.get('https://www.racingpost.com/bloodstock/sales/catalogues/5/2023-12-04/data.json', params={
            'page': str(page),
        })
        data: list[dict] = response.json()['rows']
        print(data)