Working on scraping winners and prize amount from this page: https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1 using beautiful soup. I found a similar stack post, but solving for a different page with different elements. This solution does not work for my task
I've inspected the HTML code and have tried countless possible tags and ID's. Does anyone have any advice on accessing the actual table and returning a df with the Prize, date and location for the given ticket? Thanks in advance!!
Heres my code:
from bs4 import BeautifulSoup as bs
import requests
import urllib.request
import json
import pandas as pd
from datetime import datetime as dt
website = "https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1"
result = requests.get(website)
content = result.text
soup = bs(content, 'lxml')
htmltable = soup.find('table', {'class' : 'multi-col-stacking-table'})
#print(prize.prettify())
table = soup.find('table', attrs = {'data-title': 'Prize '})
for tr in table.tbody.find_all('tr'):
print(tr.text)
I tried the above code and variations but continue getting None or blank outputs
You cannot scrape this code with beautifulsoup since it contains Javascript. Javascript is dynamically loaded, and beautifulsoup can not work dynamically since it is essentially downloading the code of a static web page.
To handle Javascript, you need a more sophisticated tool like Selenium
, which controls a web browser and is able to execute JavaScript, allowing you to interact with the dynamically loaded content.
If you have not used Selenium
before, you can pip
install it like you do for any other major package, and can look up the setup. It is very simple. I have adjusted your code to use Selenium
below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get("https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1")
time.sleep(10)
table_element = driver.find_element(By.CLASS_NAME, 'multi-col-stacking-table') # Or any other method to locate your table
rows = table_element.find_elements(By.TAG_NAME, "tr")
data = []
for row in rows:
cells = row.find_elements(By.TAG_NAME, "td")
row_data = [cell.text for cell in cells]
data.append(row_data)
columns = ['Date', 'Amount', 'Game', 'Location']
if not data[0]:
data.pop(0)
df = pd.DataFrame(data, columns=columns)
print(df)
This will create a dataframe with the following data:
If you would like to iterate through more pages, you could write a for loop to iterate through url pages. You would need to repeat each part of the process, including sleep
, to allow the pages to load. Here is the for
loop section, iterating through 10 pages with an f string:
for i in range(10):
driver.get(f"https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page={i}")