Search code examples
pythonpandasweb-scrapingbeautifulsoupplaywright

How can i scrape a table for links, click the links, and then scrape the data inside of the links?


I am currently trying to scrape the table from this website "https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1", then click on the horse names which will lead us to a new link, and scrape the tables in there as well.

This is the code I currently have. It is just a test code for the first horse. (some of the imports are for future things)

import pandas as pd
import xlsxwriter
from bs4 import BeautifulSoup
from playwright.sync_api import Playwright, sync_playwright, expect
import xlwings as xw


def scrape_ranking(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.click('text="AI ONE"')    #the link that will lead us to the horse info
        html = page.content()
        browser.close()

    tables = pd.read_html(html)
    df = tables[0]
    df.to_excel("hkjc.xlsx", index=False)

url_1 = ('https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1')
scrape_ranking(url_1)

This code doesn't crash. However, instead of printing the horse record table, it prints the original table from this website "https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2023/04/06&Racecourse=HV&RaceNo=1" (the race card).

Is there a way to make it so that the code clicks on the horse name(the link), which leads it to a new website (the horse record), and prints that table out?


Solution

  • The site opens a popup with the horse's details. You can use the code from handling popups and waiting for the page to load in the docs:

    # ...
    page.goto(url)
    
    with page.expect_popup() as popup_info:
        page.click('text="AI ONE"')
    
    popup = popup_info.value
    popup.wait_for_load_state("domcontentloaded")
    html = popup.content()
    # ...