Search code examples
pythonpandasdataframeweb-scrapingpython-requests

How to scrape a table from .cgi website to dataframe?


I want to scrape tennis data from this page: https://www.tennisabstract.com/cgi-bin/leaders.cgi for an assignment.

I need to use python libraries in Jupyter Notebook.

When I try to scrape this .cgi page I am unable to get any data from the table. Is there a way to scrape a .cgi page?

The code I am trying is:

    url = "https://www.tennisabstract.com/cgi-bin/leaders.cgi"
    response = requests.get(url, headers={"User-Agent": "XY"}) 
    #response
    page = response.content
    scraping = BeautifulSoup(page, "lxml") 
    pd.set_option('display.max_rows', None)
    table = BeautifulSoup(response.content, "lxml") 
    table = table.find_all("table")
    df = pd.read_html(str(table))
    df = df[1]

    df

The outcome I get is (which changes when I use df[0], and fails at df[2] which works for other tables on the HTML pages in the site:

0 1
0 &nbsp Stats: Serve | Return | Breaks | More
1 nan nan
2 nan nan

Solution

  • Data is loaded and rendered dynamically by JavaScript, so you will not get the table from the static response on this ressource.

    1. you could try to fetch and process the data from https://www.minorleaguesplits.com/tennisabstract/cgi-bin/jsmatches/leadersource.js

    2. you could try to mimic a browser with e.g. selenium and use the rendered source code version

    Example
    from selenium import webdriver
    import pandas as pd
    
    driver = webdriver.Chrome()
    url = f'https://www.tennisabstract.com/cgi-bin/leaders.cgi'
    driver.get(url)
    
    pd.read_html(driver.page_source, attrs={'id':'matches'})[0]
    
    Rk Player M M W-L M W% SPW SPW-InP Aces Ace% DFs DF% DF/2s 1stIn 1st% 2nd% 2%-InP Hld% Pts/SG PtsL/SG
    0 1 Novak Djokovic [SRB] 58 49-9 84.5% 69.1% 68.4% 436 8.7% 147 2.9% 8.1% 63.9% 76.2% 56.7% 61.6% 87.6% 6.1 1.9
    1 2 Jannik Sinner [ITA] 76 65-11 85.5% 69.1% 68.0% 485 8.3% 137 2.4% 6.0% 60.5% 76.8% 57.2% 60.9% 89.6% 6.1 1.9
    2 3 Carlos Alcaraz [ESP] 76 62-14 81.6% 67.2% 67.3% 319 5.6% 160 2.8% 8.3% 66.1% 72.6% 56.8% 61.9% 85.9% 6.2 2
    ...
    48 49 Zhizhen Zhang [CHN] 50 26-24 52.0% 64.5% 63.3% 340 8.3% 119 2.9% 8.0% 63.9% 72.0% 51.2% 55.6% 80.7% 6.3 2.2
    49 50 Daniel Evans [GBR] 40 16-24 40.0% 63.4% 64.4% 163 5.3% 135 4.4% 10.4% 57.6% 71.8% 52.1% 58.1% 79.2% 6.3 2.3
    50 nan Average nan nan 61.2% 65.7% 64.8% nan 8.6% nan 3.3% 9.0% 62.8% 73.7% 52.2% 57.3% 83.2% 6.3 2.2