Search code examples
javascriptpythonweb-scrapingbeautifulsoup

scraping table from web page


I'm trying to scrape a table from a webpage using Selenium and BeautifulSoup but I'm not sure how to get to the actual data using BeautifulSoup.

webpage: https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general

I tried extracting table rows (tag <tr>) but when I call find_all, the array is empty.

When I inspect element, I see several elements with a tr tag, why don't they show up with BeautifulSoup.find_all() ??

I tried extracting table rows (tag <tr>) but when I call find_all, the array is empty.

Code:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()

driver.get("https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general")

html_source = driver.page_source

soup = BeautifulSoup(html_source, 'html.parser')

table = soup.find_all("tbody")
print(len(table))
for entry in table:
    print(entry)
    print("\n")

Solution

  • why don't they show up with BeautifulSoup.find_all() ??

    after taking a quick glance, it seems like it takes a long time for the page to load.

    The thing is, when you pass the driver.page_source to BeautifulSoup, not all the HTML/CSS is loaded yet.

    So, the solution would be to use an Explicit wait:

    Wait until page is loaded with Selenium WebDriver for Python

    or, even, (less recommended):

    from time import sleep
    sleep(10)
    

    but I'm not 100% sure, since I don't currently have Selenium installed on my machine


    However, I'd like to take on a completely different solution:

    If you take a look at your browsers Network calls (Click on F12 in your browser, and it'll open the developer options), you'll see that data (the table) your looking for, is loaded through sending a GET request the their API:

    enter image description here

    The endpoint is under:

    https://api.leetify.com/api/games/5c438e85-c31c-443a-8257-5872d89e548c
    

    which you can view directly from your browser.

    So, you can directly use the requests library to make a GET request to the above endpoint, which will be much more efficent:

    import requests
    from pprint import pprint
    
    response = requests.get('https://api.leetify.com/api/games/5c438e85-c31c-443a-8257-5872d89e548c')
    data = response.json()
    
    
    pprint(data)
    

    Prints (trucated):

    {'agents': [{'gameFinishedAt': '2024-07-06T07:10:02.000Z',
                 'gameId': '5c438e85-c31c-443a-8257-5872d89e548c',
                 'id': '63e38340-d1ae-4e19-b51c-e278e3325bbb',
                 'model': 'customplayer_tm_balkan_variantk',
                 'steam64Id': '76561198062922849',
                 'teamNumber': 2},
                {'gameFinishedAt': '2024-07-06T07:10:02.000Z',
                 'gameId': '5c438e85-c31c-443a-8257-5872d89e548c',
                 'id': 'e10f9fc4-759d-493b-a17f-a85db2fcd09d',
                 'model': 'customplayer_ctm_fbi_variantg',
                 'steam64Id': '76561198062922849',
                 'teamNumber': 3},
    

    This approach bypasses the need to wait for the page to load, allowing you to directly access the data.