Search code examples
web-scrapingbeautifulsoup

Scraping masslottery with beautiful soup for scratch ticket stats


Working on scraping winners and prize amount from this page: https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1 using beautiful soup. I found a similar stack post, but solving for a different page with different elements. This solution does not work for my task

I've inspected the HTML code and have tried countless possible tags and ID's. Does anyone have any advice on accessing the actual table and returning a df with the Prize, date and location for the given ticket? Thanks in advance!!

Heres my code:

from bs4 import BeautifulSoup as bs
import requests
import urllib.request
import json
import pandas as pd
from datetime import datetime as dt

website = "https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1"
result = requests.get(website)
content = result.text

soup = bs(content, 'lxml')

htmltable = soup.find('table', {'class' : 'multi-col-stacking-table'})




#print(prize.prettify())
table = soup.find('table', attrs = {'data-title': 'Prize  '})

for tr in table.tbody.find_all('tr'):
    print(tr.text)

I tried the above code and variations but continue getting None or blank outputs


Solution

  • You cannot scrape this code with beautifulsoup since it contains Javascript. Javascript is dynamically loaded, and beautifulsoup can not work dynamically since it is essentially downloading the code of a static web page.

    To handle Javascript, you need a more sophisticated tool like Selenium, which controls a web browser and is able to execute JavaScript, allowing you to interact with the dynamically loaded content.

    If you have not used Selenium before, you can pip install it like you do for any other major package, and can look up the setup. It is very simple. I have adjusted your code to use Selenium below:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    import pandas as pd
    import time
    
    driver = webdriver.Chrome()
    
    driver.get("https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1")
    time.sleep(10) 
    
    table_element = driver.find_element(By.CLASS_NAME, 'multi-col-stacking-table')  # Or any other method to locate your table
    
    rows = table_element.find_elements(By.TAG_NAME, "tr")
    
    data = []
    for row in rows:
        cells = row.find_elements(By.TAG_NAME, "td")
        row_data = [cell.text for cell in cells]
        data.append(row_data)
    
    columns = ['Date', 'Amount', 'Game', 'Location']
    
    if not data[0]:
        data.pop(0)
    
    df = pd.DataFrame(data, columns=columns)
    print(df)
    

    This will create a dataframe with the following data:

    enter image description here

    If you would like to iterate through more pages, you could write a for loop to iterate through url pages. You would need to repeat each part of the process, including sleep, to allow the pages to load. Here is the for loop section, iterating through 10 pages with an f string:

    for i in range(10):
        driver.get(f"https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page={i}")