I'm looking to scrape historical data from:
https://www.racenet.com.au/results/horse-racing
the history is obtained by going to the "Select Date" tab and selecting a date and clicking on the "View Results" button.
You'll notice interacting with the calendar in this way does not change the URL, so I'm lost as to how to cycle through the calendar and bring up the schedule for a particular date and then how to access the results, i.e., when I select a date from the calendar manually and then "View Source" on the returned page, I don't see the links equivalent to the specific races.
Example: randomly select May 11 2021 from the calendar Mackay (QLD) is the first track listed. Right-clicking on this page and searching "Mackay" yields no match. Manually clicking the first race, "R1", sees the URL change to: https://www.racenet.com.au/results/horse-racing/mackay-20210511/smartstate-rentals-bm65-race-1 which is then fine for me to consume, it's the steps involved in cycling through the calendar dates and getting a handle on those race URLs that's my problem.
I'm hoping there's a solution in python, any tips/suggestions on how to solve this would be much appreciated.
Here's a more complete answer that will retrieve all of the horses running in each event at every meeting on the chosen day.
import requests
import time
from bs4 import BeautifulSoup
DATE = "2024-06-07"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0",
"Accept": "*/*",
"authorization": "Bearer none",
}
params = {
"operationName": "meetingsIndexByStartEndDate",
"variables": '{"startDate": "' + DATE + '", "endDate": "' + DATE + '", "limit": 100}',
"extensions": '{"persistedQuery": {"version": 1, "sha256Hash": "998212fede87c9261e0f18e9d8ced2ed04a915453dcd64ae1b5cf5a72cf25950"}}',
}
response = requests.get("https://puntapi.com/graphql-horse-racing", params=params, headers=headers)
races = response.json()
for group in races["data"]["meetingsGrouped"]:
for meeting in group["meetings"]:
for event in meeting["events"]:
time.sleep(5)
print("🟦 "+meeting["name"]+" — "+event["name"]+"\n")
URL = "https://www.racenet.com.au/results/horse-racing/"+meeting["slug"]+"/"+event["slug"]
print("URL: "+URL+"\n")
response = requests.get(URL, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
names = soup.select("h4.selection-result__info-competitor-name")
for name in names:
print(name.get_text().strip())
print()
There's a lot more data in both the API response and the static HTML. You can rummage around in that and find everything you need.
The output looks like this:
🟦 Ipswich — Tab Ipswich Cup Tickets On Sale Mdn Plate
URL: https://www.racenet.com.au/results/horse-racing/ipswich-20240607/tab-ipswich-cup-tickets-on-sale-mdn-plate-race-1
6. Vermeer
5. Salamancas
9. Luisana
8. Fionte
4. Himeji
7. Cassie's Girl
10. Kaytee Sunnyline
3. Turpin's Torment
2. Shambolic
1. Push Turbo
🟦 Ipswich — Put It On Black Mdn Hcp
URL: https://www.racenet.com.au/results/horse-racing/ipswich-20240607/put-it-on-black-mdn-hcp-race-2
3. Find Your Own
4. Hydros
8. Look It's Lucy
9. Starspangle Planet
7. Literacy
1. Arnie's Army